Understanding python xgboost cv

I would like to use the xgboost cv function to find the best options for my training dataset. I am confused by the api. How to find the best option? Does this look like the sklearn grid_search cross validation grid_search ? How to determine which of the parameters for the parameter max_depth ([2,4,6]) was determined optimal?

 from sklearn.datasets import load_iris import xgboost as xgb iris = load_iris() DTrain = xgb.DMatrix(iris.data, iris.target) x_parameters = {"max_depth":[2,4,6]} xgb.cv(x_parameters, DTrain) ... Out[6]: test-rmse-mean test-rmse-std train-rmse-mean train-rmse-std 0 0.888435 0.059403 0.888052 0.022942 1 0.854170 0.053118 0.851958 0.017982 2 0.837200 0.046986 0.833532 0.015613 3 0.829001 0.041960 0.824270 0.014501 4 0.825132 0.038176 0.819654 0.013975 5 0.823357 0.035454 0.817363 0.013722 6 0.822580 0.033540 0.816229 0.013598 7 0.822265 0.032209 0.815667 0.013538 8 0.822158 0.031287 0.815390 0.013508 9 0.822140 0.030647 0.815252 0.013494 
+5
source share
4 answers

Cross-validation is used to evaluate the performance of one set of parameters against invisible data.

Grid-search evaluates a model with various parameters to find the best combination of them.

Sklearn docs talks a lot about CVs and can be used in combination, but each one has a different purpose.

You might be able to install xgboost in the gridsearch sklearn functionality. Check out the sklearn interface on xgboost for the smoothest app.

+3
source

Sklearn GridSearchCV should be the way to go if you are looking for parameter settings. You just need to pass the xgb classifier to GridSearchCV and evaluate the best CV score.

here is a good tutorial that can help you get started setting parameters: http://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

+7
source

You can use GridSearchCV with xgboost via xgboost sklearn API

Define your classifier as follows:

 from xgboost.sklearn import XGBClassifier from sklearn.grid_search import GridSearchCV xgb_model = XGBClassifier(other_params) test_params = { 'max_depth':[4,8,12] } model = GridSearchCV(estimator = xgb_model,param_grid = test_params) model.fit(train,target) print model.best_params_ 
+3
source

I would go with hyperOpt

https://github.com/hyperopt/hyperopt

open sourced and did a great job for me. If you decide this and need help, I can clarify.

When you ask to look at "max_depth":[2,4,6] , you can naively solve this problem by running 3 models, each of which has the maximum depth, and you will see which model gives the best results.

But "max_depth" is not the only hyperparameter you should consider. There are many other hyper parameters, such as: eta (learning rate), gamma, min_child_weight, subsample , etc. Some of them continue, and some are discrete. (assuming that you know your target functions and scorecards)

you can read about all of them here https://github.com/dmlc/xgboost/blob/master/doc/parameter.md

When you look at all these “parameters” and the dimension that they create is enormous. You cannot search in it manually (and an “expert” cannot give you better arguments).

Thus, hyperOpt gives you a neat solution for this and creates a search space that is not exactly random, or a grid. All you have to do is define the parameters and their ranges.

Here you can find a sample code: https://github.com/bamine/Kaggle-stuff/blob/master/otto/hyperopt_xgboost.py

I can tell you from my own experience that it worked better than Bayesian optimization on my models. Give him a few hours / days of trial and error and contact me if you have any problems that you cannot solve.

Good luck

+2
source

Source: https://habr.com/ru/post/1239237/


All Articles