Recently, I have been doing some experiments to compare Python XgBoost and LightGBM. This LightGBM seems to be a new algorithm, according to which people say that it works better than XGBoost in both speed and accuracy.
This is a LightGBM GitHub . These are Python LightGBM Python API documents , here you will find python functions that you can call. It can be called directly from the LightGBM model, and it can also be called LightGBM scikit-learn.
This is the XGBoost Python API . I use. As you can see, it has a very similar data structure, such as the Python LightGBM API above.
Here is what I tried:
- If you use the
train() method in both XGBoost and LightGBM, yes, lightGBM is faster and more accurate. But this method has no cross validation. - If you try to use the
cv() method in both algorithms, then for cross-validation. However, I did not find a way to use it, returning a set of optimal parameters. - if you try scikit-learn
GridSearchCV() with LGBMClassifier and XGBClassifer. It works for XGBClassifer, but for LGBClassifier it works forever.
Here are my code examples when using GridSearchCV() with both classifiers:
XGBClassifier with GridSearchCV
param_set = { 'n_estimators':[50, 100, 500, 1000] } gsearch = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=100, max_depth=5, min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, nthread=7, objective= 'binary:logistic', scale_pos_weight=1, seed=410), param_grid = param_set, scoring='roc_auc',n_jobs=7,iid=False, cv=10) xgb_model2 = gsearch.fit(features_train, label_train) xgb_model2.grid_scores_, xgb_model2.best_params_, xgb_model2.best_score_
This works very well for XGBoost and only a few seconds.
LightGBM with GridSearchCV
param_set = { 'n_estimators':[20, 50] } gsearch = GridSearchCV(estimator = LGBMClassifier( boosting_type='gbdt', num_leaves=30, max_depth=5, learning_rate=0.1, n_estimators=50, max_bin=225, subsample_for_bin=0.8, objective=None, min_split_gain=0, min_child_weight=5, min_child_samples=10, subsample=1, subsample_freq=1, colsample_bytree=1, reg_alpha=1, reg_lambda=0, seed=410, nthread=7, silent=True), param_grid = param_set, scoring='roc_auc',n_jobs=7,iid=False, cv=10) lgb_model2 = gsearch.fit(features_train, label_train) lgb_model2.grid_scores_, lgb_model2.best_params_, lgb_model2.best_score_
However, using this method for LightGBM, it works all morning today, until nothing is generated.
I use the same dataset, the dataset contains 30,000 records.
I have 2 questions:
- If we just use the
cv() method, do we still need to configure the optimal set of parameters? - Do you know why
GridSearchCV() does not work with LightGBM? I wonder if this only happens to me, what happened to others?