Super strong! The top ten machine learning algorithms you must know
1. Linear regression
Linear regression is the simplest and most widely used method for predictive modeling One of the machine learning algorithms.
It is a supervised learning algorithm used to predict the value of a dependent variable based on one or more independent variables.
Definition
The core of linear regression is to fit a linear model based on observed data.
The linear model is represented by the following equation:
where
- is the dependent variable (The variable we want to predict)
- is the independent variable (the variable we use to predict)
- is the slope of the straight line
- is the y-axis intercept (the intersection of the straight line and the y-axis)
The linear regression algorithm involves finding the best path through the data points Fitting line. This is usually done by minimizing the squared difference between the observed and predicted values.
Evaluation Metrics
- Mean Square Error (MSE): The average of the squares of the measurement errors. The lower the value, the better.
- R-squared: Indicates the percentage of variation in the dependent variable that can be predicted from the independent variables. The closer to 1 the better.
from sklearn.datasets import load_diabetesfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_error, r2_score# Load the Diabetes datasetdiabetes = load_diabetes()X, y = diabetes.data, diabetes.target# Splitting the dataset into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Creating and training the Linear Regression modelmodel = LinearRegression()model.fit(X_train, y_train)# Predicting the test set resultsy_pred = model.predict(X_test)# Evaluating the modelmse = mean_squared_error(y_test, y_pred)r2 = r2_score(y_test, y_pred)print("MSE is:", mse)print("R2 score is:", r2)
2. Logistic regression
Logistic regression is used for classification problems. It predicts the probability that a given data point belongs to a certain category, such as yes/no or 0/1.
Evaluation indicators
- Accuracy: Accuracy is the number of correctly predicted observations and the total number of observations The ratio.
- Precision and Recall: Precision is the ratio of correctly predicted positive observations to all expected positive observations. Recall is the ratio of correctly predicted positive observations to all actual observations.
- F1 Score: The balance between recall and precision.
from sklearn.datasets import load_breast_cancerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score# Load the Breast Cancer datasetbreast_cancer = load_breast_cancer()X, y = breast_cancer.data, breast_cancer.target# Splitting the dataset into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Creating and training the Logistic Regression modelmodel = LogisticRegression(max_iter=10000)model.fit(X_train, y_train)# Predicting the test set resultsy_pred = model.predict(X_test)# Evaluating the modelaccuracy = accuracy_score(y_test, y_pred)precision = precision_score(y_test, y_pred)recall = recall_score(y_test, y_pred)f1 = f1_score(y_test, y_pred)# Print the resultsprint("Accuracy:", accuracy)print("Precision:", precision)print("Recall:", recall)print("F1 Score:", f1)
3. Decision tree
Decision tree is a versatile and powerful machine learning algorithm. Can be used for classification and regression tasks.
They are popular for their simplicity, interpretability, and ability to handle both numerical and categorical data.
Definition
A decision tree consists of nodes representing decision points, branches representing possible outcomes, and leaves representing the final decision or prediction.
Each node in the decision tree corresponds to a feature, and the branches represent the possible values of the feature.
The algorithm for building a decision tree involves recursively splitting a data set into subsets based on the values of different features. The goal is to create homogeneous subsets where the target variable (the variable we want to predict) is similar in each subset.
The splitting process continues until stopping criteria are met, such as maximum depth, minimum number of samples, or no further improvements can be made.
Evaluation metrics
- For classification: accuracy, precision, recall and F1 score
- For regression: mean square error (MSE), R-squared
from sklearn.datasets import load_winefrom sklearn.tree import DecisionTreeClassifier# Load the Wine datasetwine = load_wine()X, y = wine.data, wine.target# Splitting the dataset into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Creating and training the Decision Tree modelmodel = DecisionTreeClassifier(random_state=42)model.fit(X_train, y_train)# Predicting the test set resultsy_pred = model.predict(X_test)# Evaluating the modelaccuracy = accuracy_score(y_test, y_pred)precision = precision_score(y_test, y_pred, average='macro')recall = recall_score(y_test, y_pred, average='macro')f1 = f1_score(y_test, y_pred, average='macro')# Print the resultsprint("Accuracy:", accuracy)print("Precision:", precision)print("Recall:", recall)print("F1 Score:", f1)
4. Naive Bayes
Naive Bayes classifiers are a family of simple "probabilistic classifiers" that use Bayes' theorem and the assumption of strong (naive) independence between features. It is especially used for text classification.
It calculates the probability of each class and the conditional probability of each class given each input value. These probabilities are then used to classify new values based on the highest probability.
Evaluation metrics:
- Accuracy: measures the overall correctness of the model.
- Precision, Recall and F1 Score: Especially important when the class distribution is imbalanced.
from sklearn.datasets import load_digitsfrom sklearn.naive_bayes import GaussianNB# Load the Digits datasetdigits = load_digits()X, y = digits.data, digits.target# Splitting the dataset into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Creating and training the Naive Bayes modelmodel = GaussianNB()model.fit(X_train, y_train)# Predicting the test set resultsy_pred = model.predict(X_test)# Evaluating the modelaccuracy = accuracy_score(y_test, y_pred)precision = precision_score(y_test, y_pred, average='macro')recall = recall_score(y_test, y_pred, average='macro')f1 = f1_score(y_test, y_pred, average='macro')# Print the resultsprint("Accuracy:", accuracy)print("Precision:", precision)print("Recall:", recall)print("F1 Score:", f1)
5.K-最近邻(KNN)
K 最近邻 (KNN) 是一种简单直观的机器学习算法,用于分类和回归任务。
它根据输入数据点与其在特征空间中最近邻居的相似性进行预测。
在 KNN 中,新数据点的预测由其 k 个最近邻的多数类(用于分类)或平均值(用于回归)确定。KNN 中的 “k” 表示要考虑的邻居数量,这是用户选择的超参数。
算法
KNN 算法包括以下步骤
- 计算距离:计算新数据点与数据集中所有其他数据点之间的距离。
- 查找邻居:根据计算的距离选择 k 个最近邻居。
- 多数投票或平均:对于分类,分配 k 个邻居中出现最频繁的类标签。对于回归,计算 k 个邻居的目标变量的平均值。
- 进行预测:将预测的类标签或值分配给新数据点。
评估指标
- 「分类」:准确率、精确率、召回率、F1 分数。
- 「回归」:均方误差 (MSE)、R 平方。
from sklearn.datasets import load_winefrom sklearn.model_selection import train_test_splitfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score# Load the Wine datasetwine = load_wine()X, y = wine.data, wine.target# Splitting the dataset into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Creating and training the KNN modelknn_model = KNeighborsClassifier(n_neighbors=3)knn_model.fit(X_train, y_train)# Predicting the test set resultsy_pred_knn = knn_model.predict(X_test)# Evaluating the modelaccuracy_knn = accuracy_score(y_test, y_pred_knn)precision_knn = precision_score(y_test, y_pred_knn, average='macro')recall_knn = recall_score(y_test, y_pred_knn, average='macro')f1_knn = f1_score(y_test, y_pred_knn, average='macro')# Print the resultsprint("Accuracy:", accuracy_knn)print("Precision:", precision_knn)print("Recall:", recall_knn)print("F1 Score:", f1_knn)
6.SVM
支持向量机 (SVM) 是一种强大的监督学习算法,用于分类和回归任务。
它们在高维空间中特别有效,广泛应用于图像分类、文本分类和生物信息学等各个领域。
算法原理
支持向量机的工作原理是找到最能将数据分为不同类别的超平面。
选择超平面以最大化边距,即超平面与每个类的最近数据点(支持向量)之间的距离。
SVM 还可以通过使用核函数将输入空间转换为可以线性分离的高维空间来处理非线性数据。
训练 SVM 的算法包括以下步骤:
- 数据准备:预处理数据并根据需要对分类变量进行编码。
- 选择核:选择合适的核函数,例如线性、多项式或径向基函数 (RBF)。
- 模型训练:通过寻找使类之间的间隔最大化的超平面来训练 SVM。
- 模型评估:使用交叉验证或保留验证集评估 SVM 的性能。
评估指标
- 「分类」:准确率、精确率、召回率、F1 分数。
- 「回归」:均方误差 (MSE)、R 平方。
from sklearn.svm import SVCbreast_cancer = load_breast_cancer()X, y = breast_cancer.data, breast_cancer.target# Splitting the dataset into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Creating and training the SVM modelsvm_model = SVC()svm_model.fit(X_train, y_train)# Predicting the test set resultsy_pred_svm = svm_model.predict(X_test)# Evaluating the modelaccuracy_svm = accuracy_score(y_test, y_pred_svm)precision_svm = precision_score(y_test, y_pred_svm, average='macro')recall_svm = recall_score(y_test, y_pred_svm, average='macro')f1_svm = f1_score(y_test, y_pred_svm, average='macro')accuracy_svm, precision_svm, recall_svm, f1_svm# Print the resultsprint("Accuracy:", accuracy_svm)print("Precision:", precision_svm)print("Recall:", recall_svm)print("F1 Score:", f1_svm)
7.随机森林
随机森林是一种集成学习技术,它结合了多个决策树来提高预测性能并减少过度拟合。
它们广泛用于分类和回归任务,并以其鲁棒性和多功能性而闻名。
算法步骤
随机森林是根据数据集的随机子集并使用特征的随机子集进行训练的决策树的集合。
森林中的每棵决策树独立地进行预测,最终的预测是通过聚合所有树的预测来确定的。
构建随机森林的算法包括以下步骤
- 随机采样:从数据集中随机选择样本子集(带替换)来训练每棵树。
- 特征随机化:随机选择每个节点的特征子集以考虑分割。
- 树构建:使用采样数据和特征构建多个决策树。
- 投票或平均:聚合所有树的预测以做出最终预测。
评估指标
- 分类:准确率、精确率、召回率、F1 分数。
- 回归:均方误差 (MSE)、R 平方。
from sklearn.ensemble import RandomForestClassifierbreast_cancer = load_breast_cancer()X, y = breast_cancer.data, breast_cancer.target# Splitting the dataset into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Creating and training the Random Forest modelrf_model = RandomForestClassifier(random_state=42)rf_model.fit(X_train, y_train)# Predicting the test set resultsy_pred_rf = rf_model.predict(X_test)# Evaluating the modelaccuracy_rf = accuracy_score(y_test, y_pred_rf)precision_rf = precision_score(y_test, y_pred_rf, average='macro')recall_rf = recall_score(y_test, y_pred_rf, average='macro')f1_rf = f1_score(y_test, y_pred_rf, average='macro')# Print the resultsprint("Accuracy:", accuracy)print("Precision:", precision)print("Recall:", recall)print("F1 Score:", f1)
8.K-均值聚类
K 均值聚类是一种无监督学习算法,用于将数据分组为 “K” 个聚类。确定 k 个质心后,每个数据点被分配到最近的簇。
该算法将数据点分配给一个簇,使得数据点与簇质心之间的平方距离之和最小。
评估指标
- 「惯性」:样本到最近聚类中心的总平方距离称为惯性。值越低越好。
- 「Silhouette Score」:表示一个项目属于其自身集群的紧密程度。高轮廓分数意味着该项目与其自身的集群匹配良好,而与附近的集群匹配不佳。轮廓得分从 -1 到 1。
from sklearn.datasets import load_irisfrom sklearn.cluster import KMeansfrom sklearn.metrics import silhouette_score# Load the Iris datasetiris = load_iris()X = iris.data# Applying K-Means Clusteringkmeans = KMeans(n_clusters=3, random_state=42)kmeans.fit(X)# Predicting the cluster for each data pointy_pred_clusters = kmeans.predict(X)# Evaluating the modelinertia = kmeans.inertia_silhouette = silhouette_score(X, y_pred_clusters)print("Inertia:", inertia)print("Silhouette:", silhouette)
9.PCA
降维是通过使用主成分分析 (PCA) 来完成的。它将数据转换为新的坐标系,减少变量数量,同时尽可能多地保留原始数据的变化。
使用 PCA 可以找到使数据方差最大化的主要成分或轴。第一个主成分捕获最大方差,第二个主成分(与第一个主成分正交)捕获第二大方差,依此类推。
评估指标
- 「解释方差」:表示每个主成分捕获的数据方差有多少。
- 「总解释方差」:由所选主成分解释的累积方差。
from sklearn.datasets import load_breast_cancerfrom sklearn.decomposition import PCAimport numpy as np# Load the Breast Cancer datasetbreast_cancer = load_breast_cancer()X = breast_cancer.data# Applying PCApca = PCA(n_compnotallow=2)# Reducing to 2 dimensions for simplicitypca.fit(X)# Transforming the dataX_pca = pca.transform(X)# Explained Varianceexplained_variance = pca.explained_variance_ratio_# Total Explained Variancetotal_explained_variance = np.sum(explained_variance)print("Explained variance:", explained_variance)print("Total Explained Variance:", total_explained_variance)
10.梯度提升算法
梯度提升是一种先进的机器学习技术。它依次构建多个弱预测模型(通常是决策树)。每个新模型都逐渐最小化整个模型的损失函数(误差)。
评估指标
- 「对于分类」:准确率、精确率、召回率、F1 分数。
- 「对于回归」:均方误差 (MSE)、R 平方。
from sklearn.datasets import load_diabetesfrom sklearn.ensemble import GradientBoostingRegressorfrom sklearn.metrics import mean_squared_error, r2_score# Load the Diabetes datasetdiabetes = load_diabetes()X, y = diabetes.data, diabetes.target# Splitting the dataset into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Creating and training the Gradient Boosting modelgb_model = GradientBoostingRegressor(random_state=42)gb_model.fit(X_train, y_train)# Predicting the test set resultsy_pred_gb = gb_model.predict(X_test)# Evaluating the modelmse_gb = mean_squared_error(y_test, y_pred_gb)r2_gb = r2_score(y_test, y_pred_gb)print("MSE:", mse_gb)
The above is the detailed content of Super strong! The top ten machine learning algorithms you must know. For more information, please follow other related articles on the PHP Chinese website!

Generative AI, exemplified by chatbots like ChatGPT, offers project managers powerful tools to streamline workflows and ensure projects stay on schedule and within budget. However, effective use hinges on crafting the right prompts. Precise, detail

The challenge of defining Artificial General Intelligence (AGI) is significant. Claims of AGI progress often lack a clear benchmark, with definitions tailored to fit pre-determined research directions. This article explores a novel approach to defin

IBM Watsonx.data: Streamlining the Enterprise AI Data Stack IBM positions watsonx.data as a pivotal platform for enterprises aiming to accelerate the delivery of precise and scalable generative AI solutions. This is achieved by simplifying the compl

The rapid advancements in robotics, fueled by breakthroughs in AI and materials science, are poised to usher in a new era of humanoid robots. For years, industrial automation has been the primary focus, but the capabilities of robots are rapidly exp

The biggest update of Netflix interface in a decade: smarter, more personalized, embracing diverse content Netflix announced its largest revamp of its user interface in a decade, not only a new look, but also adds more information about each show, and introduces smarter AI search tools that can understand vague concepts such as "ambient" and more flexible structures to better demonstrate the company's interest in emerging video games, live events, sports events and other new types of content. To keep up with the trend, the new vertical video component on mobile will make it easier for fans to scroll through trailers and clips, watch the full show or share content with others. This reminds you of the infinite scrolling and very successful short video website Ti

The growing discussion of general intelligence (AGI) in artificial intelligence has prompted many to think about what happens when artificial intelligence surpasses human intelligence. Whether this moment is close or far away depends on who you ask, but I don’t think it’s the most important milestone we should focus on. Which earlier AI milestones will affect everyone? What milestones have been achieved? Here are three things I think have happened. Artificial intelligence surpasses human weaknesses In the 2022 movie "Social Dilemma", Tristan Harris of the Center for Humane Technology pointed out that artificial intelligence has surpassed human weaknesses. What does this mean? This means that artificial intelligence has been able to use humans

TransUnion's CTO, Ranganath Achanta, spearheaded a significant technological transformation since joining the company following its Neustar acquisition in late 2021. His leadership of over 7,000 associates across various departments has focused on u

Building trust is paramount for successful AI adoption in business. This is especially true given the human element within business processes. Employees, like anyone else, harbor concerns about AI and its implementation. Deloitte researchers are sc


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Atom editor mac version download
The most popular open source editor

SublimeText3 English version
Recommended: Win version, supports code prompts!

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.
