XGBoost hyperparameter tuning with Bayesian optimization using Python
By Simon Löw |
XGBoost is one of the leading algorithms in data science right now, giving unparalleled performance on many Kaggle competitions and real-world problems. Unfortunately, XGBoost has a lot of hyperparameters that need to be tuned to achieve optimal performance. In the following, I will show you how you can implement Bayesian optimization in Python to automatically find the best hyperparameters easily and efficiently.
What you will learn:
- What Bayesian optimization is and why it is superior to random or grid search
- How to implement Bayesian optimization in Python
- How you can automatically optimize your XGBoost hyperparameters using Bayesian optimization
What is Bayesian optimization?
In a nutshell, Bayesian optimization trains a machine learning model to predict the best hyperparameters. You can think about your hyperparameter selection problem as a function optimization. For each set of hyperparameters, you get a different model performance and thus a different result under your performance metric. If you would know the shape of the function, that maps hyperparameters to model performance, you could easily pick the right parameters. But, unfortunately, the shape is unknown.
Grid search and random search would solve the problem by just blindly searching the whole parameters space (either systematically in a grid or just randomly). But for model with a large parameter space, like XGBoost, they are slow and painfully inefficient. Bayesian optimization on the other side, builds a model for the optimization function and explores the parameter space systematically, which is a smart and much faster way to find your parameters
The method we will use here uses Gaussian processes to predict our loss function based on the hyperparameters. Initially, several hyperparameter sets are picked and the loss of the model is calculated. Based on those points the first Gaussian process model is trained.
We now have a machine learning model to predict the loss for different hyper parameter sets based on that loss function and can make a more educated guess, what the optimal hyperparameters might be. Since it’s a Bayesian model, we not only know the expected loss value in each point, we also have an estimation for the uncertainty. Based on this information we can now pick the hyperparameter set, where the Gaussian process model expects the best model performance. We can either take that hyperparameter set or just calculate the real loss and continue to Bayesian optimization. With each iteration and each new guess, we get more data to train our model and get a more accurate estimation of the loss function.
Bayesian optimization with Python
Enough theory for now. Let’s jump right into the implementation. Luckily, there is a nice and simple Python library for Bayesian optimization, called bayes_opt.
To use the library you just need to implement one simple function, that takes your hyperparameter as a parameter and returns your desired loss function:
def hyperparam_loss(param_x, param_y):
# 1. Define machine learning model using param_x, param_y as hyper parameters
# 2. Train the model
# 3. Calculate loss on cross-validation set
return loss
Hyperparameter tuning for XGBoost
Alright, let’s jump right into our XGBoost optimization problem. For our XGBoost model we want to optimize the following hyperparameters:
- learning_rate: The learning rate of the model. Typical values are 1.0 to 0.01.
- n_estimators: The total number of estimators used. Typical numbers range from 100 to 1000, dependent on the dataset size and complexity.
- max_depth: The depth of each estimator tree. Typical values are 3 to 10.
- subsample: The percentage of samples that are used to build each estimator tree. Typical value range is 0.8 to 1.0.
- colsample_bytree: The percentage of columns used to build each tree. The range depends on the number of columns / features in the dataset. Should be 1.0 for datasets with few columns.
- gamma: A regularization parameter. Usually ranging from 0 to 5. If you notice massive overfitting in the data (training set performance much better than test set performance) try bigger values.
As the optimization objective we use ROC AUC, but dependent on your problem different metrics can make sense.
Now that we know the hyperparameters as well as our optimization objective, we can define our optimization problem with just a few lines of Python code. Note that max_depth and n_estimators can only be integer values and thus need to be converted:
import numpy as np
from xgboost import XGBClassifier
from bayes_opt import BayesianOptimization
from sklearn.model_selection import cross_val_score
pbounds = {
'learning_rate': (0.01, 1.0),
'n_estimators': (100, 1000),
'max_depth': (3,10),
'subsample': (1.0, 1.0), # Change for big datasets
'colsample': (1.0, 1.0), # Change for datasets with lots of features
'gamma': (0, 5)}
def xgboost_hyper_param(learning_rate,
n_estimators,
max_depth,
subsample,
colsample,
gamma):
max_depth = int(max_depth)
n_estimators = int(n_estimators)
clf = XGBClassifier(
max_depth=max_depth,
learning_rate=learning_rate,
n_estimators=n_estimators,
gamma=gamma)
return np.mean(cross_val_score(clf, X_train, y_train, cv=3, scoring='roc_auc'))
optimizer = BayesianOptimization(
f=xgboost_hyper_param,
pbounds=pbounds,
random_state=1,
)
You might wonder about line 22 in the code above. When optimizing your hyperparameters, it’s tempting to just use the training or test set to evaluate the model performance. But the problem is that we would either overfit on the training or the test set. Thus it’s better to use cross-validation and evaluate the model performance on a separate data-set. For the example here I used 3-fold cross-validation, so we get 3 different AUC values for 3 different splits of the data.
Summary
You now know how to implement Bayesian optimization for our XGBoost model using Python. For simple cases, you could just use the Python code from above and replace the data set with your own. For more complex cases you might want to dig a bit deeper and explore all the details about Bayesian optimization (In fact Bayesian optimization itself has some hyperparameter than can be tuned..). Note also that Bayesian optimization can be applied to all kinds of optimization problems and different machine learning algorithms, so make sure you play around with the library a bit.
Like the post? Share it!