GridSearchCV is extremely slow on a small dataset in scikit-learn

Question

GridSearchCV is extremely slow on a small dataset in scikit-learn

This is odd. I can successfully run the grid_search_digits.py example. However, I cannot perform a grid search on my own data.

I have the following setup:

 import sklearn from sklearn.svm import SVC from sklearn.grid_search import GridSearchCV from sklearn.cross_validation import LeaveOneOut from sklearn.metrics import auc_score # ... Build X and y .... tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100, 1000]}, {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}] loo = LeaveOneOut(len(y)) clf = GridSearchCV(SVC(C=1), tuned_parameters, score_func=auc_score) clf.fit(X, y, cv=loo) .... print clf.best_estimator_ ....

But I never miss clf.fit (I left it for ~ 1 hour).

I tried also with

 clf.fit(X, y, cv=10)

and

 skf = StratifiedKFold(y,2) clf.fit(X, y, cv=skf)

and had the same problem (it never completes clf.fit statement). My data is simple:

 > X.shape (27,26) > y.shape 27 > numpy.sum(y) 5 > y.dtype dtype('int64') >?y Type: ndarray String Form:[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1] Length: 27 File: /home/jacob04/opt/python/numpy/numpy-1.7.1/lib/python2.7/site- packages/numpy/__init__.py Docstring: <no docstring> Class Docstring: ndarray(shape, dtype=float, buffer=None, offset=0, strides=None, order=None) > ?X Type: ndarray String Form: [[ -3.61238468e+03 -3.61253920e+03 -3.61290196e+03 -3.61326679e+03 7.84590361e+02 0.0000 <...> 0000e+00 2.22389150e+00 2.53252959e+00 2.11606216e+00 -1.99613432e+05 -1.99564828e+05]] Length: 27 File: /home/jacob04/opt/python/numpy/numpy-1.7.1/lib/python2.7/site- packages/numpy/__init__.py Docstring: <no docstring> Class Docstring: ndarray(shape, dtype=float, buffer=None, offset=0, strides=None, order=None)

This is all with the latest version of scikit-learn (0.13.1) and:

 $ pip freeze Cython==0.19.1 PIL==1.1.7 PyXB==1.2.2 PyYAML==3.10 argparse==1.2.1 distribute==0.6.34 epc==0.0.5 ipython==0.13.2 jedi==0.6.0 matplotlib==1.3.x nltk==2.0.4 nose==1.3.0 numexpr==2.1 numpy==1.7.1 pandas==0.11.0 pyparsing==1.5.7 python-dateutil==2.1 pytz==2013b rpy2==2.3.1 scikit-learn==0.13.1 scipy==0.12.0 sexpdata==0.0.3 six==1.3.0 stemming==1.0.1 -e git+https://github.com/PyTables/ PyTables.git@df7b20444b0737cf34686b5d88b4e674ec85575b #egg=tables-dev tornado==3.0.1 wsgiref==0.1.2

It is odd that installing one SVM is extremely fast:

 > %timeit clf2 = svm.SVC(); clf2.fit(X,y) 1000 loops, best of 3: 328 us per loop

Update

I noticed that if I pre-scale the data with

 from sklearn import preprocessing X = preprocessing.scale(X)

the search in the grid is very fast.

Why? Why is GridSearchCV so sensitive to scaling that regular svm.SVC().fit not?

+6

python numpy scikit-learn

Amelio vazquez-reina Jul 03 '13 at 18:11

source share

3 answers

Vector machine support is scalable . Most likely, your SVC takes longer to create a separate model. GridSearch is basically a brute force method that manages base models with various parameters. So, if your GridSearchCV takes time to build, this is most likely due to

A large number of combinations of parameters (what is wrong here)
Your individual model takes a lot of time.

+3

Santosh Apr 7 '14 at 14:29

source share

I just want to point out one thing that any GridSearchCV can speed up if you are not already using it:

GridSearchCV has an n_jobs parameter that uses several cores of your processor, which will speed up the process. For instance:

 GridSearchCV(clf, verbose=1, param_grid=tuned_parameters, n_jobs=-1)

Specifying -1 will use all available CPU cores. If you have four cores with hyperthreading, this will speed up 8 concurrent workers, not 1 if you do not use this option

0

Heapify Sep 26 '19 at 19:26

source share

user3666197 · Accepted Answer · 2014-05-22T17:46:15+0000

As already noted, for SVM classifiers (as y == np.int* ) preprocessing is mandatory , otherwise the ML-Estimator prediction ability is lost directly under the influence of distorted functions on the division function.

How objected processing time:

try to better understand what your AI / ML model Overfit / Generalization [C,gamma] landscape represents
try adding verbosity to the initial setup of the AI / ML process
try adding n_jobs to the number of crunches
try adding grid computing to your computing approach if scale requires

.

 aGrid = aML_GS.GridSearchCV( aClassifierOBJECT, param_grid = aGrid_of_parameters, cv = cv, n_jobs = n_JobsOnMultiCpuCores, verbose = 5 )

Sometimes GridSearchCV() can actually use a huge amount of CPU-time / CPU-poolOfRESOURCEs, even after using all of the above tips.

So, be calm and don't panic if you are sure that the preprocessing of Feature-Engineering, data-sanity, and FeatureDOMAIN was done correctly.

 [GridSearchCV] ................ C=16777216.0, gamma=0.5, score=0.761619 -62.7min [GridSearchCV] C=16777216.0, gamma=0.5 ......................................... [GridSearchCV] ................ C=16777216.0, gamma=0.5, score=0.792793 -64.4min [GridSearchCV] C=16777216.0, gamma=1.0 ......................................... [GridSearchCV] ............... C=16777216.0, gamma=1.0, score=0.793103 -116.4min [GridSearchCV] C=16777216.0, gamma=1.0 ......................................... [GridSearchCV] ............... C=16777216.0, gamma=1.0, score=0.794603 -205.4min [GridSearchCV] C=16777216.0, gamma=1.0 ......................................... [GridSearchCV] ............... C=16777216.0, gamma=1.0, score=0.771772 -200.9min [GridSearchCV] C=16777216.0, gamma=2.0 ......................................... [GridSearchCV] ............... C=16777216.0, gamma=2.0, score=0.713643 -446.0min [GridSearchCV] C=16777216.0, gamma=2.0 ......................................... [GridSearchCV] ............... C=16777216.0, gamma=2.0, score=0.743628 -184.6min [GridSearchCV] C=16777216.0, gamma=2.0 ......................................... [GridSearchCV] ............... C=16777216.0, gamma=2.0, score=0.761261 -281.2min [GridSearchCV] C=16777216.0, gamma=4.0 ......................................... [GridSearchCV] ............... C=16777216.0, gamma=4.0, score=0.670165 -138.7min [GridSearchCV] C=16777216.0, gamma=4.0 ......................................... [GridSearchCV] ................ C=16777216.0, gamma=4.0, score=0.760120 -97.3min [GridSearchCV] C=16777216.0, gamma=4.0 ......................................... [GridSearchCV] ................ C=16777216.0, gamma=4.0, score=0.732733 -66.3min [GridSearchCV] C=16777216.0, gamma=8.0 ......................................... [GridSearchCV] ................ C=16777216.0, gamma=8.0, score=0.755622 -13.6min [GridSearchCV] C=16777216.0, gamma=8.0 ......................................... [GridSearchCV] ................ C=16777216.0, gamma=8.0, score=0.772114 - 4.6min [GridSearchCV] C=16777216.0, gamma=8.0 ......................................... [GridSearchCV] ................ C=16777216.0, gamma=8.0, score=0.717718 -14.7min [GridSearchCV] C=16777216.0, gamma=16.0 ........................................ [GridSearchCV] ............... C=16777216.0, gamma=16.0, score=0.763118 - 1.3min [GridSearchCV] C=16777216.0, gamma=16.0 ........................................ [GridSearchCV] ............... C=16777216.0, gamma=16.0, score=0.746627 - 25.4s [GridSearchCV] C=16777216.0, gamma=16.0 ........................................ [GridSearchCV] ............... C=16777216.0, gamma=16.0, score=0.738739 - 44.9s [Parallel(n_jobs=1)]: Done 2700 out of 2700 | elapsed: 5670.8min finished

As already mentioned about the "... regular svm.SVC().fit " kindly notice, it uses the default values [C,gamma] and therefore is not related to the behavior of your model / ProblemDOMAIN.

Re: Update

Yes, indeed, regularizing / scaling SVM inputs is a must for this AI / ML tool. scikit-learn has good hardware for creating and reusing aScalerOBJECT for a priori scaling (before aDataSET goes into .fit() ) and ex-post ad-hoc scaling, as soon as you need to re-scale the new example and send it to the predictor to respond to its magic at the request of anSvmCLASSIFIER.predict( aScalerOBJECT.transform( aNewExampleX ) )

(Yes, aNewExampleX can be a matrix, so ask for "vectorial" processing of multiple responses)

Relief performance O (M ^ 2.N ^ 1) computational complexity

In contrast to the following, suppose that the "width" problem, measured as N ==, the number of SVM functions in the matrix X should be charged with the total computation time, the SVM classifier with the rbf core is a projected problem O(M^2.N^1) .

Thus, there is a quadratic dependence on the total number of observations (examples) moved to the training phase ( .fit() ) or CrossValidation, and it can hardly be argued that a controlled training classifier will receive any better intellectual strength if one “reduces” (only linear) "width" of the functions that themselves carry the inputs to the constructed predictive ability of the SVM classifier, right?

GridSearchCV is extremely slow on a small dataset in scikit-learn

Update

Re: Update

Relief performance O (M ^ 2.N ^ 1) computational complexity

More articles: