How to evaluate the progress of GridSearchCV from the detailed output in Scikit-Learn?

Question

How to evaluate the progress of GridSearchCV from the detailed output in Scikit-Learn?

Now I am launching a rather aggressive grid search. I have n=135 samples and I am running 23 folds using a custom cross- 23 folds / test list. I have verbose=2 .

Below I run:

 param_test = {"loss":["deviance"], 'learning_rate':[0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2], "min_samples_split": np.linspace(0.1, 0.5, 12), "min_samples_leaf": np.linspace(0.1, 0.5, 12), "max_depth":[3,5,8], "max_features":["log2","sqrt"], "min_impurity_split":[5e-6, 1e-7, 5e-7], "criterion": ["friedman_mse", "mae"], "subsample":[0.5, 0.618, 0.8, 0.85, 0.9, 0.95, 1.0], "n_estimators":[10]} Mod_gsearch = GridSearchCV(estimator = GradientBoostingClassifier(), param_grid = param_test, scoring="accuracy",n_jobs=32, iid=False, cv=cv_indices, verbose=2)

I looked at the detailed output in stdout :

 $head gridsearch.o8475533 Fitting 23 folds for each of 254016 candidates, totalling 5842368 fits

Based on this, it looks like there are permutation 5842368 cross validations using my grid parameters.

 $ grep -c "[CV]" gridsearch.o8475533 7047332

It seems that about 7 million cross-confirmations were made, but this is more than the total number of 5842368 ...

 7047332/5842368 = 1.2062458236

Then, when I look at the stderr file:

 $ cat ./gridsearch.e8475533 [Parallel(n_jobs=32)]: Done 132 tasks | elapsed: 1.2s [Parallel(n_jobs=32)]: Done 538 tasks | elapsed: 2.8s [Parallel(n_jobs=32)]: Done 1104 tasks | elapsed: 4.8s [Parallel(n_jobs=32)]: Done 1834 tasks | elapsed: 7.9s [Parallel(n_jobs=32)]: Done 2724 tasks | elapsed: 11.6s ... [Parallel(n_jobs=32)]: Done 3396203 tasks | elapsed: 250.2min [Parallel(n_jobs=32)]: Done 3420769 tasks | elapsed: 276.5min [Parallel(n_jobs=32)]: Done 3447309 tasks | elapsed: 279.3min [Parallel(n_jobs=32)]: Done 3484240 tasks | elapsed: 282.3min [Parallel(n_jobs=32)]: Done 3523550 tasks | elapsed: 285.3min

My goal:

How can I find out about the progress of my gridsearch regarding the total time it may take?

What bothers me:

What is the relationship between [CV] lines in stdout , the total number of litters in stdout and tasks in stderr ?

+9

python scikit-learn parameters machine-learning grid-search

O.rka Apr 13 '17 at 18:02

source share

1 answer

vladkha · Accepted Answer · 2017-05-20T23:48:06+0000

The math is simple, but at first glance a little misleading:

When each task starts, the logging mechanism displays the string "[CV] ..." for stdout , marking the starting execution, and after the ends task, another line with the addition of time spent for a specific task (at the end of the line).
In addition, at some time intervals, the logging mechanism writes a progress bar to stderr (or if verbose set to> 50 to stdout ), indicating the number of completed tasks from the total number of tasks (selections) and the total number of tasks that are currently spent. time like this:
[Parallel(n_jobs=32)]: Done 2724 tasks | elapsed: 11.6s

In your case, you have a total number of matches of 5842368 , i.e. tasks.

You calculated 7047332 from "[CV] ...", that is, around 7047332/2 = 3523666 completed tasks, and the progress bar shows exactly how many tasks have been completed - 3523550 (approximately - because some tasks can begin, but not end at that time to count).

How to evaluate the progress of GridSearchCV from the detailed output in Scikit-Learn?

More articles: