Running gridsearch using python scikit-learn library on an Amazon EC2 cluster

Question

Running gridsearch using python scikit-learn library on an Amazon EC2 cluster

Sorry if this question is somewhat specific to the python Scikit-learn library.

I am trying to do a grid search to find the best option for scikit-learn GradientBoostingRegressor . The problem is that I do not know where to start. I used to use the R and RStudio settings, but I'm trying to upgrade to Python for Data Mining, and Scikit seems very promising.

Can someone share some simple setup code that they could use to compute on an Amazon EC2 cluster, or perhaps point to a useful example link for this library for another machine learning algorithm?

Thanks.

+4

python scikit-learn amazon-ec2

ak3nat0n Oct 30 '12 at 18:08

source share

2 answers

I completely agree with ogrisel - StarCluster is very convenient, as it allows you to quickly install an IPython cluster and supports spot instances, which is great because they are much cheaper than regular ones.

You can find the code for this method , which shows you how to perform a distributed grid search for Gradient Boosting skeiners in an IPython cluster.

It performs a grid search combined with cross-validation and stores the estimated grid points in the MongoDB database.

The code automatically selects the best number of trees based on the average cross-validation score.

Happy setting!

+5

Peter Prettenhofer Oct 31 '12 at 19:53

source share

ogrisel · Accepted Answer · 2012-10-30T18:45:42+0000

As far as I know, GBRT is a fairly consistent algorithm, so there is no trivial way to run it in parallel.

Random forests / ExtraTrees models are awkwardly parallel, therefore, they would be the best candidates for training models in a cluster.

scikit-learn has built-in support for single-processor multiprocessing using joblib (check the docstring of models that take the n_jobs argument). At some point, we plan to implement a job submission infrastructure in joblib. Thus, we could, for example, use IPython in parallel as a backend for working in a cluster. However, nothing is ready for this right now.

If you are willing to spend some time on this, I would advise you to take a look at StarCluster and its IPython plugin:

Running gridsearch using python scikit-learn library on an Amazon EC2 cluster

More articles: