The challenge of machine learning: which tool to use?

Question

The challenge of machine learning: which tool to use?

I am currently experimenting with an ML task, which includes supervised training of a classification model. To date, I have ~ 5M training examples and ~ 5M cross-validation examples. In each example, there are currently 46 functions, but in the near future I might want to create 10 more, so any solution should leave room for improvement.

My problem is this: what tool do I use to solve this problem? I would like to use random forests or SVM, however I am afraid that the latter may be too slow in my case. I looked at Mahout, but turned away because it seems to require some configuration in conjunction with running command line scripts. I would prefer to use the code directly for some (well-documented!) Library or define my model with a graphical interface.

I should also point out that I'm looking for something that will work on Windows (without things like cygwin), and that solutions that work well with .NET are much appreciated.

You can imagine that when the time comes, the code will be launched on the Cluster Compute Eight Extra Large Instance on Amazon EC ² , so anything that makes extensive use of RAM and multi-core processors is welcome.

And last but not least, I will indicate that my data set is dense (in that there is no missing value / all columns have a value for each vector)

+6

cloud amazon-ec2 machine-learning classification

em70 Dec 24 '11 at 10:48

source share

2 answers

Yevgeny · Answer 1 · 2012-01-09T16:37:53+0000

I regularly run similar arrays of string / function sets in R on EC2 (the instance type of 16 cores / 60 GB you are referring to is especially useful if you use a method that can take advantage of the multi-processor like caret package.) How As you already mentioned, not all training methods (such as SVM) will work well on such a data set.

You might want to use a 10% sample or so for rapid prototyping / performance before switching to work in a complete dataset.

If you want extremely high performance, then Vowpal Wabbit is better suited (but it only supports generalized linear students, so there is no gbm or Random Forest .) In addition, VW is not very convenient for Windows.

carlosdc · Answer 2 · 2011-12-26T04:20:34+0000

I would recommend looking at the stochastic gradient descent for this problem scale. A good viewing tool is VowpalWabbit . With that size, you can probably run your experiments on the desktop with reasonable specs. The only drawback for you, I think, is that it is not Windows oriented, but although I have not tested it, it should work on cygwin.

EDIT:. The developers got a lot of interest in getting VowpalWabbit to work on Windows. As of March 2013, VowpalWabbit (version 7.2) runs on Windows out of the box. There are several additional / additional functions that are not yet implemented in Windows, one of them launches VowpalWabbit as a daemon, but it seems that it will be processed in the near future.

The challenge of machine learning: which tool to use?

More articles: