Python - LightGBM with GridSearchCV, works forever

Recently, I have been doing some experiments to compare Python XgBoost and LightGBM. This LightGBM seems to be a new algorithm, according to which people say that it works better than XGBoost in both speed and accuracy.

This is a LightGBM GitHub . These are Python LightGBM Python API documents , here you will find python functions that you can call. It can be called directly from the LightGBM model, and it can also be called LightGBM scikit-learn.

This is the XGBoost Python API . I use. As you can see, it has a very similar data structure, such as the Python LightGBM API above.

Here is what I tried:

  • If you use the train() method in both XGBoost and LightGBM, yes, lightGBM is faster and more accurate. But this method has no cross validation.
  • If you try to use the cv() method in both algorithms, then for cross-validation. However, I did not find a way to use it, returning a set of optimal parameters.
  • if you try scikit-learn GridSearchCV() with LGBMClassifier and XGBClassifer. It works for XGBClassifer, but for LGBClassifier it works forever.

Here are my code examples when using GridSearchCV() with both classifiers:

XGBClassifier with GridSearchCV

 param_set = { 'n_estimators':[50, 100, 500, 1000] } gsearch = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=100, max_depth=5, min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, nthread=7, objective= 'binary:logistic', scale_pos_weight=1, seed=410), param_grid = param_set, scoring='roc_auc',n_jobs=7,iid=False, cv=10) xgb_model2 = gsearch.fit(features_train, label_train) xgb_model2.grid_scores_, xgb_model2.best_params_, xgb_model2.best_score_ 

This works very well for XGBoost and only a few seconds.

LightGBM with GridSearchCV

 param_set = { 'n_estimators':[20, 50] } gsearch = GridSearchCV(estimator = LGBMClassifier( boosting_type='gbdt', num_leaves=30, max_depth=5, learning_rate=0.1, n_estimators=50, max_bin=225, subsample_for_bin=0.8, objective=None, min_split_gain=0, min_child_weight=5, min_child_samples=10, subsample=1, subsample_freq=1, colsample_bytree=1, reg_alpha=1, reg_lambda=0, seed=410, nthread=7, silent=True), param_grid = param_set, scoring='roc_auc',n_jobs=7,iid=False, cv=10) lgb_model2 = gsearch.fit(features_train, label_train) lgb_model2.grid_scores_, lgb_model2.best_params_, lgb_model2.best_score_ 

However, using this method for LightGBM, it works all morning today, until nothing is generated.

I use the same dataset, the dataset contains 30,000 records.

I have 2 questions:

  • If we just use the cv() method, do we still need to configure the optimal set of parameters?
  • Do you know why GridSearchCV() does not work with LightGBM? I wonder if this only happens to me, what happened to others?
+5
source share
1 answer

Try using n_jobs = 1 and see if it works.

In general, if you use n_jobs = -1 or n_jobs > 1 , then you should protect your script with if __name__=='__main__': ::

A simple example:

 import ... if __name__=='__main__': data= pd.read_csv('Prior Decompo2.csv', header=None) X, y = data.iloc[0:, 0:26].values, data.iloc[0:,26].values param_grid = {'C' : [0.01, 0.1, 1, 10], 'kernel': ('rbf', 'linear')} classifier = SVC() grid_search = GridSearchCV(estimator=classifier, param_grid=param_grid, scoring='accuracy', n_jobs=-1, verbose=42) grid_search.fit(X,y) 

Finally, can you try to run your code with n_jobs = -1 and enable if __name__=='__main__': as I explained, and see if it works?

0
source

Source: https://habr.com/ru/post/1269736/


All Articles