Amazon EC2 vs PiCloud

We are students who are trying to process data of about 140 million records and are trying to run several machine learning algorithms. we are new to all cloud solutions and mahout implementations. We have currently installed them in the postgresql database, but the current implementation does not scale, and the read / write operations seem to be very slow after numerous performance tweaks. We plan for cloud services.

We examined several possible alternatives.

  • Amazon Cloud Services (Mahout implementation)
  • Picloud with scikits will find out (we planned to use the HDF5 format with NumPy)
  • Please recommend any other options, if any.

Here are the following questions.

  • What will bring us the best results (time will turn around) and will be economically profitable? Please mention any other alternatives.
  • If we set up amazon services, how should we have a data format? If we use dynamodb, will it cost to take off?

thanks

+6
source share
5 answers

PiCloud is built on top of AWS, so anyway, you'll be using Amazon at the end of the day. The question is how much infrastructure you have to write to get everything connected. PiCloud gives some free use to put it through the balls so you can drop it first. I have not used it myself, but it is clear that it is trying to ensure the ease of deployment of machine learning applications.

It seems like this is trying to get results, rather than being a cloud project, so I would either study using one of Amazon's other services besides direct EC2, or else some other software like PiCloud or Heroku or some other service that might look after the bootstrap.

+5
source

It depends on the nature of the machine learning problem you want to solve. I would recommend that you first summarize your data set with what is suitable in memory (for example, 100 thousand samples with several hundred non-zero functions per sample, assuming a sparse representation).

Then try some machine learning algorithms that scale to a large number of samples in scikit-learn:

  • SGDClassifier or MultinomialNB if you want to perform controlled classification (if you have categorical labels for prediction in your dataset)
  • SGDRegressor if you want to perform controlled regression (if you have a constant target variable for forecasting)
  • The MiniBatchKMeans keyboard performs uncontrolled clustering (but then there is no objective way to quantify the quality of the resulting clusters by default).
  • ...

Search the grid to find the optimal model hyperparameters (for example, the alpha regularizer and the number of passes n_iter for SGDClassifier) ​​and evaluate the performance using cross-validation.

After execution, try again using a 2x larger data set (still suitable in memory) and make sure that this significantly improves the accuracy of the forecast. If this is not the case, then do not waste time parallelizing this in the cluster in order to run it on a complete data set, since it will not give any better results.

If it does what you could do, overlay the data into pieces, then the data fragments on each node, examine the SGDClassifier or SGDRegressor model on each node independently of picloud and collect the weights ( coef_ and intercept_ ), and then calculate the average weights to build finite linear model and evaluate it on some stretched fragment of your data set.

More on error analysis. See how to build learning curves:

+7
source

AWS has a program to support educational users , so you might want to do some research on this program.

0
source

You should take a look at numba if you are looking for some Numpy speedups: https://github.com/numba/numba

Does not solve the problem of scaling the cloud, but can reduce the time for computing.

0
source

I just made a comparison between PiCloud and Amazon EC2> may be useful.

-1
source

Source: https://habr.com/ru/post/910432/


All Articles