I am currently experimenting with an ML task, which includes supervised training of a classification model. To date, I have ~ 5M training examples and ~ 5M cross-validation examples. In each example, there are currently 46 functions, but in the near future I might want to create 10 more, so any solution should leave room for improvement.
My problem is this: what tool do I use to solve this problem? I would like to use random forests or SVM, however I am afraid that the latter may be too slow in my case. I looked at Mahout, but turned away because it seems to require some configuration in conjunction with running command line scripts. I would prefer to use the code directly for some (well-documented!) Library or define my model with a graphical interface.
I should also point out that I'm looking for something that will work on Windows (without things like cygwin), and that solutions that work well with .NET are much appreciated.
You can imagine that when the time comes, the code will be launched on the Cluster Compute Eight Extra Large Instance on Amazon EC 2 , so anything that makes extensive use of RAM and multi-core processors is welcome.
And last but not least, I will indicate that my data set is dense (in that there is no missing value / all columns have a value for each vector)
source share