Home  >  Article  >  Technology peripherals  >  Comparison of hyperparameter optimization: grid search, random search and Bayesian optimization

Comparison of hyperparameter optimization: grid search, random search and Bayesian optimization

WBOY
WBOYforward
2023-04-04 12:05:082044browse

This article will introduce in detail the most common hyperparameter optimization methods used to improve machine learning results.

Translator| The solution is to add more training data. Additional data is often helpful (except in certain circumstances), but generating high-quality data can be very expensive. Hyperparameter optimization saves us time and resources by using existing data to get the best model performance. ​

As the name suggests, hyperparameter optimization is the process of determining the best combination of hyperparameters for a machine learning model to satisfy the optimization function (i.e., maximize the performance of the model given the data set under study). In other words, each model provides multiple tuning “buttons” of options that we can change until we achieve an optimal combination of hyperparameters for our model. Some examples of parameters that we can change during hyperparameter optimization can be the learning rate, the architecture of the neural network (e.g., the number of hidden layers), regularization, etc. ​

In this article, we will conceptually introduce the three most common hyperparameter optimization methods, namely grid search, random search and Bayesian optimization, and then implement them one by one. ​

I will provide a high-level comparison table at the beginning of the article for the reader’s reference, and then further explore, explain, and implement each item in the comparison table throughout the rest of the article. ​

Table 1: Comparison of hyperparameter optimization methods​

1. Grid search algorithm​

Comparison of hyperparameter optimization: grid search, random search and Bayesian optimizationGrid search may be a super The simplest and most intuitive method of parameter optimization, which involves an exhaustive search for the best combination of hyperparameters in a defined search space. The "search space" in this context is the entire hyperparameters and the values ​​of such hyperparameters considered during optimization. Let’s understand grid search better with an example. ​

Suppose we have a machine learning model with only three parameters. Each parameter can take the value provided in the table: ​

parameter_1 = [1 , 2, 3]​

parameter_2 = [a, b, c]​

parameter_3 = [x, y, z]​

We don’t know which combination of these parameters will optimize our Optimization function of the model (i.e. providing the best output for our machine learning model). In grid search, we simply try every combination of these parameters, measure the model performance for each parameter, and simply choose the combination that yields the best performance! In this example, parameter 1 can take on 3 values ​​(i.e. 1, 2, or 3), parameter 2 can take on 3 values ​​(i.e. a, b, and c), and parameter 3 can take on 3 values ​​(i.e. x, y, and z). In other words, there are 3*3*3=27 combinations in total. The grid search in this example will involve 27 rounds of evaluating the performance of the machine learning model to find the best performing combination. ​

As you can see, this method is very simple (similar to a trial and error task), but it also has some limitations. Let's summarize the advantages and disadvantages of this method. ​

Among them, the advantages include:​

Easy to understand and implement

Easy to parallelize

Suitable for both discrete and continuous spaces​ The main disadvantages are: ​
  • Costly in large and/or complex models with a large number of hyperparameters (because all combinations must be tried and evaluated) ​
  • No memory - no learning from the past Learning by Observation​
  • If the search space is too large, you may not be able to find the best combination. My suggestion is that if you have a simple model with a small search space, use grid search; otherwise, it is recommended to continue to Read on to find solutions better suited to larger search spaces.
  • Now, let’s implement grid search using a real example.
  • 1.1. Grid search algorithm implementation
  • In order to implement grid search, we will use the Iris data set in scikit-learn to create a random forest classification model. This data set includes 3 different iris petal and sepal lengths and will be used for this classification exercise. In this paper, model development is secondary as the goal is to compare the performance of various hyperparameter optimization strategies. I encourage you to focus on the model evaluation results and the time required for each hyperparameter optimization method to reach the selected set of hyperparameters. I will describe the results of the run and then provide a summary comparison table for the three methods used in this article. ​
  • The search space including all hyperparameter values ​​is defined as follows:​

search_space = {'n_estimators': [10, 100, 500, 1000],​

'max_depth' : [2, 10, 25, 50, 100],​

'min_samples_split': [2, 5, 10],​

'min_samples_leaf': [1, 5, 10]}​

The above search space consists of a total combination of 4*5*3*3=180 hyperparameters. We will use grid search to find the combination that optimizes the objective function as follows:​

# Import libraries​
from sklearn.model_selection import GridSearchCV​
from sklearn.datasets import load_iris​
from sklearn.ensemble import RandomForestClassifier​
from sklearn.model_selection import cross_val_score​
import time​
# Load Iris data set​
iris = load_iris()​
X, y = iris.data, iris.target​
#Define hyperparameter search space​
search_space = {'n_estimators': [10, 100, 500, 1000],​
'max_depth': [2, 10, 25, 50, 100],​
'min_samples_split': [2, 5, 10 ],​
'min_samples_leaf': [1, 5, 10]}​
#Define Random Forest Classifier​
clf = RandomForestClassifier(random_state=1234)​
# Generate optimizer object​
optimizer = GridSearchCV(clf, search_space, cv=5, scoring='accuracy')​
#Storage the starting time for calculating the total time
start_time = time.time() ​
#Optimizer on fitting data​
optimizer.fit(X, y)​
# Store the end time so that it can be used to calculate the total time consuming
end_time = time.time ()​
# Print the best hyperparameter set and corresponding score​
print(f"selected hyperparameters:")​
print(optimizer.best_params_)​
print("")​
print(f"best_score: {optimizer.best_score_}")​
print(f"elapsed_time: {round(end_time-start_time, 1)}")​

The output of the above code is as follows :​

Comparison of hyperparameter optimization: grid search, random search and Bayesian optimization

#Here we can see the hyperparameter values ​​selected using grid search. Among them, best_score describes the evaluation results using the selected set of hyperparameters, and elapsed_time describes the time it took my local laptop to execute this hyperparameter optimization strategy. As you proceed to the next method, keep the evaluation results and elapsed time in mind for comparison. Now, let's get into the discussion of random search. ​

2. Random search algorithm​

As the name suggests, random search is the process of randomly sampling hyperparameters from a defined search space. Unlike grid search, random search only selects a random subset of hyperparameter values ​​for a predefined number of iterations (depending on available resources such as time, budget, goals, etc.) and computes the machine learning model for each hyperparameter performance, and then choose the best hyperparameter values. ​

Based on the above approach, you can imagine that random search is less expensive than a full grid search, but still has its own advantages and disadvantages, as follows:​

Advantages:​

  • Easy to understand and implement​
  • Easy to parallelize​
  • Suitable for both discrete and continuous spaces​
  • Compared to grids Cheaper search​
  • More likely to converge to optimal than grid search with the same number of attempts Disadvantages:​
  • No memory – does not learn from past observations​
  • Considering random selection, important hyperparameter values ​​may be missed

In the next method, we will solve the "memoryless" problem of grid and random search through Bayesian optimization "shortcoming. But before discussing this method, let's implement random search. ​

2.1. Random search algorithm implementation

Using the following code snippet, we will implement random search hyperparameter optimization for the same problem described in the grid search implementation. ​

# Import library​
from sklearn.model_selection import RandomizedSearchCV​
from scipy.stats import randint​

# Create a RandomizedSearchCV object​
optimizer = RandomizedSearchCV( clf, param_distributinotallow=search_space,​
n_iter=50, cv=5, scoring='accuracy',​
random_state=1234)​

# Store the start time to calculate the total running time​
start_time = time.time()​

# Fit the optimizer on the data​
optimizer.fit(X, y)​

# Store the end time to calculate the total Running time​
end_time = time.time()​

# Print the optimal hyperparameter set and corresponding score​
print(f"selected hyperparameters:")​
print(optimizer .best_params_)​
print("")​
print(f"best_score: {optimizer.best_score_}")​
print(f"elapsed_time: {round(end_time-start_time, 1)}" )​

The output results of the above code are as follows:​

Comparison of hyperparameter optimization: grid search, random search and Bayesian optimization

Random search results​

Compared with the results of grid search, These results are very interesting. best_score remains the same, but elapsed_time decreases from 352.0 seconds to 75.5 seconds! How impressive! In other words, the random search algorithm managed to find a set of hyperparameters that performed the same as grid search in about 21% of the time required by grid search! However, the efficiency here is much higher. ​

Next, let’s move on to our next method, called Bayesian optimization, which learns from every attempt during the optimization process. ​

3. Bayesian Optimization

Bayesian optimization is a hyperparameter optimization method that uses a probabilistic model to "learn" from previous attempts and direct the search to the hyperparameters in the search space. The best combination to optimize the objective function of the machine learning model. ​

The Bayesian optimization method can be divided into 4 steps, which I will describe below. I encourage you to read through these steps to better understand the process, but there is no prerequisite knowledge required to use this method. ​

  1. Define a "prior", which is a probabilistic model evaluation of our beliefs about the most likely combination of hyperparameters to optimize the objective function at a certain point in time
  2. Evaluation Model of Hyperparameter Samples​
  3. Use the knowledge gained in step 2 to update the probabilistic model in step 1 (what we call the “prior”) to understand the best hyperparameters that we believe optimize the objective function. Where are the possible combinations. Our updated belief is called the "posterior". In other words, the knowledge gained in step 2 helps us understand the search space better and takes us from the prior to the posterior, making the posterior our "latest" knowledge about the search space and objective function, as determined by step 2Provide information​
  4. Repeat steps 2 and 3 until model performance converges, resources are exhausted, or other predefined metrics are met​

If you are interested in learning more about Baye For detailed information on Si optimization, you can view the following post:

"Bayesian Optimization Algorithm in Machine Learning", the address is:
https://medium.com/@fmnobar/conceptual -overview-of-bayesian-optimization-for-parameter-tuning-in-machine-learning-a3b1b4b9339f. ​​​

Now, now that we understand how Bayesian optimization works, let’s look at its advantages and disadvantages. ​

Advantages:​

  • Learns from past observations and is therefore more efficient. In other words, it is expected to find a better set of hyperparameters in fewer iterations than memoryless methods that converge to the optimal given certain assumptions. Disadvantages: ​
  • Difficult to parallelize​
  • Computationally intensive than grid and random search per iteration​
  • Prior and functions used in Bayesian optimization (e.g., The choice of the initial probability distribution for the acquisition function, etc.) can significantly affect the performance and its learning curve​
  • With the details out of the way, let’s implement Bayesian optimization and see the results. ​

3.1. Bayesian Optimization Algorithm Implementation

Similar to the previous section, we will use the following code snippet to implement Bayesian hyperparameters for the same problem described in Grid Search Implementation optimization. ​

# Import library​

from skopt import BayesSearchCV​


# Perform Bayesian optimization​
optimizer = BayesSearchCV(estimator=RandomForestClassifier(),​
search_spaces =search_space,​
n_iter=10,​
cv=5,​
scoring='accuracy',​
random_state=1234)​

# Store the start time to calculate Total running time​
start_time = time.time()​

optimizer.fit(X, y)​

# Store the end time to calculate the total running time
end_time = time.time()​

# Print the best hyperparameter set and corresponding score​
print(f"selected hyperparameters:")​
print(optimizer.best_params_)​
print("")​
print(f"best_score: {optimizer.best_score_}")​
print(f"elapsed_time: {round(end_time-start_time, 1)}")​

The output results of the above code are as follows:​

Comparison of hyperparameter optimization: grid search, random search and Bayesian optimizationBayesian optimization results​

Another set of interesting results! The best_score is consistent with the results we obtained with grid and random search, but the result took only 23.1 seconds compared to 75.5 seconds for random search and 352.0 seconds for grid search! In other words, using Bayesian optimization takes approximately 93% less time than grid search. This is a huge productivity boost that becomes even more meaningful in larger, more complex models and search spaces. ​

Note that Bayesian optimization only used 10 iterations to obtain these results because it can learn from previous iterations (as opposed to random and grid search). ​

Result Comparison

The table below compares the results of the three methods discussed so far. The “Methodology” column describes the hyperparameter optimization method used. This is followed by the hyperparameters selected using each method. "Best Score" is the score obtained using a specific method, and then "Elapsed Time" represents how long it took for the optimization strategy to run on my local laptop. The last column, Gained Efficiency, assumes grid search as the baseline and then calculates the efficiency gained by each of the other two methods compared to grid search (using elapsed time). For example, since random search takes 75.5 seconds and grid search takes 352.0 seconds, the efficiency of random search relative to the grid search baseline is calculated as 1–75.5/352.0=78.5%. ​

Comparison of hyperparameter optimization: grid search, random search and Bayesian optimization

Table 2 - Method Performance Comparison Table​

Two main conclusions in the above comparison table:​

  1. Efficiency: We can see how learning methods such as Bayesian optimization can find an optimized set of hyperparameters in less time. ​
  2. Parameter selection: There can be multiple correct answers. For example, the selected parameters for Bayesian optimization are different from those for grid and random search, although the evaluation metric (i.e. best_score) remains the same. This is even more important in larger, more complex environments. ​

Conclusion

In this article, we discussed what hyperparameter optimization is and introduced the three most common methods used for this optimization exercise. We then introduce each of these three methods in detail and implement them in a classification exercise. Finally, we compare the results of implementing the three methods. We find that methods such as Bayesian optimization learned from previous attempts can significantly improve efficiency, which can be an important factor in large complex models such as deep neural networks, where efficiency can be a determining factor. ​

Translator Introduction

Zhu Xianzhong, 51CTO community editor, 51CTO expert blogger, lecturer, computer teacher at a university in Weifang, and a veteran in the freelance programming industry. ​

Original title: ​​Hyperparameter Optimization — Intro and Implementation of Grid Search, Random Search and Bayesian Optimization​​, author: Farzad Mahmoodinobar​



The above is the detailed content of Comparison of hyperparameter optimization: grid search, random search and Bayesian optimization. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete