機器學習模型選擇。-Python教學-PHP中文網

首頁

後端開發

Python教學

機器學習模型選擇。

Barbara Streisand

Sep 25, 2024 am 06:30 AM

ML Model Selection.

一、簡介

在本文中，我們將學習如何在具有不同超參數的多個模型之間選擇最佳模型，在某些情況下，我們可以擁有50 多個不同的模型，了解如何選擇一個模型對於為您的資料集獲得最佳效能的模型非常重要.

我們將透過選擇最佳學習演算法及其最佳超參數來進行模型選擇。

但是首先什麼是超參數？這些是用戶設定的附加設置，將影響模型學習其參數的方式。參數另一方面是模型在訓練過程中學習的內容。

2. 使用窮舉搜尋。

窮舉搜尋涉及透過搜尋一系列超參數來選擇最佳模型。為此，我們利用 scikit-learn 的 GridSearchCV.

GridSearchCV 的工作原理：

使用者為一個或多個超參數定義一組可能的值。
GridSearchCV 使用每個值和/或值的組合來訓練模型。
表現最佳的模型被選為最佳模型。

範例
我們可以設定邏輯迴歸作為我們的學習演算法並調整兩個超參數（C 和正規化懲罰）。我們也可以指定兩個參數：求解器和最大迭代次數。

現在，對於 C 和正規化懲罰值的每種組合，我們訓練模型並使用 k 折交叉驗證對其進行評估。
因為我們有 10 個可能的 C 值，所以有 2 個可能的 reg 值。懲罰和 5 倍，我們總共有 (10 x 2 x 5 = 100) 個候選模型，從中選擇最好的。

# Load libraries
import numpy as np
from sklearn import linear_model, datasets
from sklearn.model_selection import GridSearchCV

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create logistic regression
logistic = linear_model.LogisticRegression(max_iter=500, solver='liblinear')

# Create range of candidate penalty hyperparameter values
penalty = ['l1','l2']

# Create range of candidate regularization hyperparameter values
C = np.logspace(0, 4, 10)

# Create dictionary of hyperparameter candidates
hyperparameters = dict(C=C, penalty=penalty)

# Create grid search
gridsearch = GridSearchCV(logistic, hyperparameters, cv=5, verbose=0)

# Fit grid search
best_model = gridsearch.fit(features, target)

# Show the best model
print(best_model.best_estimator_)

# LogisticRegression(C=7.742636826811269, max_iter=500, penalty='l1',
solver='liblinear') # Result

得到最佳模型：

# View best hyperparameters
print('Best Penalty:', best_model.best_estimator_.get_params()['penalty'])
print('Best C:', best_model.best_estimator_.get_params()['C'])

# Best Penalty: l1 #Result
# Best C: 7.742636826811269 # Result

3. 使用隨機搜尋。

當您想要一種比窮舉搜尋更便宜的計算方法來選擇最佳模型時，通常會使用這種方法。

值得注意的是，RandomizedSearchCV 基本上並不比 GridSearchCV 更快，但它通常只需透過測試較少的組合即可在更短的時間內實現與 GridSearchCV 相當的效能。

RandomizedSearchCV 的工作原理:

使用者將提供超參數/分佈（例如常態、均勻）。
演算法將隨機搜尋給定超參數值的特定數量的隨機組合，而不進行替換。

範例

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create logistic regression
logistic = linear_model.LogisticRegression(max_iter=500, solver='liblinear')

# Create range of candidate regularization penalty hyperparameter values
penalty = ['l1', 'l2']

# Create distribution of candidate regularization hyperparameter values
C = uniform(loc=0, scale=4)

# Create hyperparameter options
hyperparameters = dict(C=C, penalty=penalty)

# Create randomized search
randomizedsearch = RandomizedSearchCV(
logistic, hyperparameters, random_state=1, n_iter=100, cv=5, verbose=0,
n_jobs=-1)

# Fit randomized search
best_model = randomizedsearch.fit(features, target)

# Print best model
print(best_model.best_estimator_)

# LogisticRegression(C=1.668088018810296, max_iter=500, penalty='l1',
solver='liblinear') #Result.

得到最佳模型：

# View best hyperparameters
print('Best Penalty:', best_model.best_estimator_.get_params()['penalty'])
print('Best C:', best_model.best_estimator_.get_params()['C'])

# Best Penalty: l1 # Result
# Best C: 1.668088018810296 # Result

注意：訓練的候選模型數量在n_iter（迭代次數）設定中指定。

4. 從多種學習演算法中選擇最佳模型。

在這一部分中，我們將了解如何透過搜尋一系列學習演算法及其各自的超參數來選擇最佳模型。

我們可以透過簡單地建立候選學習演算法及其超參數的字典來用作 GridSearchCV.

的搜尋空間來做到這一點

步驟：

我們可以定義一個包含兩種學習演算法的搜尋空間。
我們指定超參數，並使用格式分類器[超參數名稱]_定義它們的候選值。

# Load libraries
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# Set random seed
np.random.seed(0)

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create a pipeline
pipe = Pipeline([("classifier", RandomForestClassifier())])

# Create dictionary with candidate learning algorithms and their hyperparameters
search_space = [{"classifier": [LogisticRegression(max_iter=500,
solver='liblinear')],
"classifier__penalty": ['l1', 'l2'],
"classifier__C": np.logspace(0, 4, 10)},
{"classifier": [RandomForestClassifier()],
"classifier__n_estimators": [10, 100, 1000],
"classifier__max_features": [1, 2, 3]}]

# Create grid search
gridsearch = GridSearchCV(pipe, search_space, cv=5, verbose=0)

# Fit grid search
best_model = gridsearch.fit(features, target)

# Print best model
print(best_model.best_estimator_)

# Pipeline(steps=[('classifier',
                 LogisticRegression(C=7.742636826811269, max_iter=500,
                      penalty='l1', solver='liblinear'))])

最佳模特兒：
搜尋完成後，我們可以使用best_estimator_查看最佳模型的學習演算法和超參數。

5. 預處理時選擇最佳模型。

有時我們可能希望在模型選擇過程中包含預處理步驟。
最好的解決方案是建立一個包含預處理步驟及其任何參數的管道：

第一個挑戰：
GridSeachCv 使用交叉驗證來確定效能最高的模型。

然而，在交叉驗證中，我們假裝未看到測試集時保留的折疊，因此不屬於任何預處理步驟（例如縮放或標準化）。

因此，預處理步驟必須是 GridSearchCV 所採取的操作集的一部分。

解
Scikit-learn 提供了 FeatureUnion，它允許我們正確組合多個預處理操作。
步驟：

We use _FeatureUnion _to combine two preprocessing steps: standardize the feature values(StandardScaler) and principal component analysis(PCA) - this object is called the preprocess and contains both of our preprocessing steps.
Next we include preprocess in our pipeline with our learning algorithm.

This allows us to outsource the proper handling of fitting, transforming, and training the models with combinations of hyperparameters to scikit-learn.

Second Challenge:
Some preprocessing methods such as PCA have their own parameters, dimensionality reduction using PCA requires the user to define the number of principal components to use to produce the transformed features set. Ideally we would choose the number of components that produces a model with the greatest performance for some evaluation test metric.
Solution.
In scikit-learn when we include candidate component values in the search space, they are treated like any other hyperparameter to be searched over.

# Load libraries
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Set random seed
np.random.seed(0)

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create a preprocessing object that includes StandardScaler features and PCA
preprocess = FeatureUnion([("std", StandardScaler()), ("pca", PCA())])

# Create a pipeline
pipe = Pipeline([("preprocess", preprocess),
               ("classifier", LogisticRegression(max_iter=1000,
               solver='liblinear'))])

# Create space of candidate values
search_space = [{"preprocess__pca__n_components": [1, 2, 3],
"classifier__penalty": ["l1", "l2"],
"classifier__C": np.logspace(0, 4, 10)}]

# Create grid search
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0, n_jobs=-1)

# Fit grid search
best_model = clf.fit(features, target)

# Print best model
print(best_model.best_estimator_)

# Pipeline(steps=[('preprocess',
     FeatureUnion(transformer_list=[('std', StandardScaler()),
                                    ('pca', PCA(n_components=1))])),
    ('classifier',
    LogisticRegression(C=7.742636826811269, max_iter=1000,
                      penalty='l1', solver='liblinear'))]) # Result

After the model selection is complete we can view the preprocessing values that produced the best model.

Preprocessing steps that produced the best modes

# View best n_components

best_model.best_estimator_.get_params() 
# ['preprocess__pca__n_components'] # Results

5. Speeding Up Model Selection with Parallelization.

That time you need to reduce the time it takes to select a model.
We can do this by training multiple models simultaneously, this is done by using all the cores in our machine by setting n_jobs=-1

# Load libraries
import numpy as np
from sklearn import linear_model, datasets
from sklearn.model_selection import GridSearchCV

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create logistic regression
logistic = linear_model.LogisticRegression(max_iter=500, 
                                           solver='liblinear')

# Create range of candidate regularization penalty hyperparameter values
penalty = ["l1", "l2"]

# Create range of candidate values for C
C = np.logspace(0, 4, 1000)

# Create hyperparameter options
hyperparameters = dict(C=C, penalty=penalty)

# Create grid search
gridsearch = GridSearchCV(logistic, hyperparameters, cv=5, n_jobs=-1, 
                             verbose=1)

# Fit grid search
best_model = gridsearch.fit(features, target)

# Print best model
print(best_model.best_estimator_)

# Fitting 5 folds for each of 2000 candidates, totalling 10000 fits
# LogisticRegression(C=5.926151812475554, max_iter=500, penalty='l1',
                                                  solver='liblinear')

6. Speeding Up Model Selection ( Algorithm Specific Methods).

This a way to speed up model selection without using additional compute power.

This is possible because scikit-learn has model-specific cross-validation hyperparameter tuning.

Sometimes the characteristics of a learning algorithms allows us to search for the best hyperparameters significantly faster.

Example:
LogisticRegression is used to conduct a standard logistic regression classifier.
LogisticRegressionCV implements an efficient cross-validated logistic regression classifier that can identify the optimum value of the hyperparameter C.

# Load libraries
from sklearn import linear_model, datasets

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create cross-validated logistic regression
logit = linear_model.LogisticRegressionCV(Cs=100, max_iter=500,
                                            solver='liblinear')

# Train model
logit.fit(features, target)

# Print model
print(logit)

# LogisticRegressionCV(Cs=100, max_iter=500, solver='liblinear')

Note:A major downside to LogisticRegressionCV is that it can only search a range of values for C. This limitation is common to many of scikit-learn's model-specific cross-validated approaches.

I hope this Article was helpful in creating a quick overview of how to select a machine learning model.

以上是機器學習模型選擇。的詳細內容。更多資訊請關注PHP中文網其他相關文章！

陳述

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn

Python vs.C：申請和用例Apr 12, 2025 am 12:01 AM

Python适合数据科学、Web开发和自动化任务，而C 适用于系统编程、游戏开发和嵌入式系统。Python以简洁和强大的生态系统著称，C 则以高性能和底层控制能力闻名。

2小時的Python計劃：一種現實的方法Apr 11, 2025 am 12:04 AM

2小時內可以學會Python的基本編程概念和技能。 1.學習變量和數據類型，2.掌握控制流（條件語句和循環），3.理解函數的定義和使用，4.通過簡單示例和代碼片段快速上手Python編程。

Python：探索其主要應用程序Apr 10, 2025 am 09:41 AM

Python在web開發、數據科學、機器學習、自動化和腳本編寫等領域有廣泛應用。 1)在web開發中，Django和Flask框架簡化了開發過程。 2)數據科學和機器學習領域，NumPy、Pandas、Scikit-learn和TensorFlow庫提供了強大支持。 3)自動化和腳本編寫方面，Python適用於自動化測試和系統管理等任務。