GridSearchCV is extremely slow on a small dataset in scikit-learn

This is odd. I can successfully run the grid_search_digits.py example. However, I cannot perform a grid search on my own data.

I have the following setup:

 import sklearn from sklearn.svm import SVC from sklearn.grid_search import GridSearchCV from sklearn.cross_validation import LeaveOneOut from sklearn.metrics import auc_score # ... Build X and y .... tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100, 1000]}, {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}] loo = LeaveOneOut(len(y)) clf = GridSearchCV(SVC(C=1), tuned_parameters, score_func=auc_score) clf.fit(X, y, cv=loo) .... print clf.best_estimator_ .... 

But I never miss clf.fit (I left it for ~ 1 hour).

I tried also with

 clf.fit(X, y, cv=10) 

and

 skf = StratifiedKFold(y,2) clf.fit(X, y, cv=skf) 

and had the same problem (it never completes clf.fit statement). My data is simple:

 > X.shape (27,26) > y.shape 27 > numpy.sum(y) 5 > y.dtype dtype('int64') >?y Type: ndarray String Form:[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1] Length: 27 File: /home/jacob04/opt/python/numpy/numpy-1.7.1/lib/python2.7/site- packages/numpy/__init__.py Docstring: <no docstring> Class Docstring: ndarray(shape, dtype=float, buffer=None, offset=0, strides=None, order=None) > ?X Type: ndarray String Form: [[ -3.61238468e+03 -3.61253920e+03 -3.61290196e+03 -3.61326679e+03 7.84590361e+02 0.0000 <...> 0000e+00 2.22389150e+00 2.53252959e+00 2.11606216e+00 -1.99613432e+05 -1.99564828e+05]] Length: 27 File: /home/jacob04/opt/python/numpy/numpy-1.7.1/lib/python2.7/site- packages/numpy/__init__.py Docstring: <no docstring> Class Docstring: ndarray(shape, dtype=float, buffer=None, offset=0, strides=None, order=None) 

This is all with the latest version of scikit-learn (0.13.1) and:

 $ pip freeze Cython==0.19.1 PIL==1.1.7 PyXB==1.2.2 PyYAML==3.10 argparse==1.2.1 distribute==0.6.34 epc==0.0.5 ipython==0.13.2 jedi==0.6.0 matplotlib==1.3.x nltk==2.0.4 nose==1.3.0 numexpr==2.1 numpy==1.7.1 pandas==0.11.0 pyparsing==1.5.7 python-dateutil==2.1 pytz==2013b rpy2==2.3.1 scikit-learn==0.13.1 scipy==0.12.0 sexpdata==0.0.3 six==1.3.0 stemming==1.0.1 -e git+https://github.com/PyTables/ PyTables.git@df7b20444b0737cf34686b5d88b4e674ec85575b #egg=tables-dev tornado==3.0.1 wsgiref==0.1.2 

It is odd that installing one SVM is extremely fast:

 > %timeit clf2 = svm.SVC(); clf2.fit(X,y) 1000 loops, best of 3: 328 us per loop 

Update

I noticed that if I pre-scale the data with

 from sklearn import preprocessing X = preprocessing.scale(X) 

the search in the grid is very fast.

Why? Why is GridSearchCV so sensitive to scaling that regular svm.SVC().fit not?

+6
source share
3 answers

As already noted, for SVM classifiers (as y == np.int* ) preprocessing is mandatory , otherwise the ML-Estimator prediction ability is lost directly under the influence of distorted functions on the division function.

How objected processing time:

  • try to better understand what your AI / ML model Overfit / Generalization [C,gamma] landscape represents
  • try adding verbosity to the initial setup of the AI ​​/ ML process
  • try adding n_jobs to the number of crunches
  • try adding grid computing to your computing approach if scale requires

.

 aGrid = aML_GS.GridSearchCV( aClassifierOBJECT, param_grid = aGrid_of_parameters, cv = cv, n_jobs = n_JobsOnMultiCpuCores, verbose = 5 ) 

Sometimes GridSearchCV() can actually use a huge amount of CPU-time / CPU-poolOfRESOURCEs, even after using all of the above tips.

So, be calm and don't panic if you are sure that the preprocessing of Feature-Engineering, data-sanity, and FeatureDOMAIN was done correctly.

 [GridSearchCV] ................ C=16777216.0, gamma=0.5, score=0.761619 -62.7min [GridSearchCV] C=16777216.0, gamma=0.5 ......................................... [GridSearchCV] ................ C=16777216.0, gamma=0.5, score=0.792793 -64.4min [GridSearchCV] C=16777216.0, gamma=1.0 ......................................... [GridSearchCV] ............... C=16777216.0, gamma=1.0, score=0.793103 -116.4min [GridSearchCV] C=16777216.0, gamma=1.0 ......................................... [GridSearchCV] ............... C=16777216.0, gamma=1.0, score=0.794603 -205.4min [GridSearchCV] C=16777216.0, gamma=1.0 ......................................... [GridSearchCV] ............... C=16777216.0, gamma=1.0, score=0.771772 -200.9min [GridSearchCV] C=16777216.0, gamma=2.0 ......................................... [GridSearchCV] ............... C=16777216.0, gamma=2.0, score=0.713643 -446.0min [GridSearchCV] C=16777216.0, gamma=2.0 ......................................... [GridSearchCV] ............... C=16777216.0, gamma=2.0, score=0.743628 -184.6min [GridSearchCV] C=16777216.0, gamma=2.0 ......................................... [GridSearchCV] ............... C=16777216.0, gamma=2.0, score=0.761261 -281.2min [GridSearchCV] C=16777216.0, gamma=4.0 ......................................... [GridSearchCV] ............... C=16777216.0, gamma=4.0, score=0.670165 -138.7min [GridSearchCV] C=16777216.0, gamma=4.0 ......................................... [GridSearchCV] ................ C=16777216.0, gamma=4.0, score=0.760120 -97.3min [GridSearchCV] C=16777216.0, gamma=4.0 ......................................... [GridSearchCV] ................ C=16777216.0, gamma=4.0, score=0.732733 -66.3min [GridSearchCV] C=16777216.0, gamma=8.0 ......................................... [GridSearchCV] ................ C=16777216.0, gamma=8.0, score=0.755622 -13.6min [GridSearchCV] C=16777216.0, gamma=8.0 ......................................... [GridSearchCV] ................ C=16777216.0, gamma=8.0, score=0.772114 - 4.6min [GridSearchCV] C=16777216.0, gamma=8.0 ......................................... [GridSearchCV] ................ C=16777216.0, gamma=8.0, score=0.717718 -14.7min [GridSearchCV] C=16777216.0, gamma=16.0 ........................................ [GridSearchCV] ............... C=16777216.0, gamma=16.0, score=0.763118 - 1.3min [GridSearchCV] C=16777216.0, gamma=16.0 ........................................ [GridSearchCV] ............... C=16777216.0, gamma=16.0, score=0.746627 - 25.4s [GridSearchCV] C=16777216.0, gamma=16.0 ........................................ [GridSearchCV] ............... C=16777216.0, gamma=16.0, score=0.738739 - 44.9s [Parallel(n_jobs=1)]: Done 2700 out of 2700 | elapsed: 5670.8min finished 

As already mentioned about the "... regular svm.SVC().fit " kindly notice, it uses the default values [C,gamma] and therefore is not related to the behavior of your model / ProblemDOMAIN.

Re: Update

Yes, indeed, regularizing / scaling SVM inputs is a must for this AI / ML tool. scikit-learn has good hardware for creating and reusing aScalerOBJECT for a priori scaling (before aDataSET goes into .fit() ) and ex-post ad-hoc scaling, as soon as you need to re-scale the new example and send it to the predictor to respond to its magic at the request of anSvmCLASSIFIER.predict( aScalerOBJECT.transform( aNewExampleX ) )

(Yes, aNewExampleX can be a matrix, so ask for "vectorial" processing of multiple responses)

Relief performance O (M ^ 2.N ^ 1) computational complexity

In contrast to the following, suppose that the "width" problem, measured as N ==, the number of SVM functions in the matrix X should be charged with the total computation time, the SVM classifier with the rbf core is a projected problem O(M^2.N^1) .

Thus, there is a quadratic dependence on the total number of observations (examples) moved to the training phase ( .fit() ) or CrossValidation, and it can hardly be argued that a controlled training classifier will receive any better intellectual strength if one β€œreduces” (only linear) "width" of the functions that themselves carry the inputs to the constructed predictive ability of the SVM classifier, right?

+7
source

Vector machine support is scalable . Most likely, your SVC takes longer to create a separate model. GridSearch is basically a brute force method that manages base models with various parameters. So, if your GridSearchCV takes time to build, this is most likely due to

  • A large number of combinations of parameters (what is wrong here)
  • Your individual model takes a lot of time.
+3
source

I just want to point out one thing that any GridSearchCV can speed up if you are not already using it:

GridSearchCV has an n_jobs parameter that uses several cores of your processor, which will speed up the process. For instance:

 GridSearchCV(clf, verbose=1, param_grid=tuned_parameters, n_jobs=-1) 

Specifying -1 will use all available CPU cores. If you have four cores with hyperthreading, this will speed up 8 concurrent workers, not 1 if you do not use this option

0
source

Source: https://habr.com/ru/post/1489564/


All Articles