It depends on the nature of the machine learning problem you want to solve. I would recommend that you first summarize your data set with what is suitable in memory (for example, 100 thousand samples with several hundred non-zero functions per sample, assuming a sparse representation).
Then try some machine learning algorithms that scale to a large number of samples in scikit-learn:
- SGDClassifier or MultinomialNB if you want to perform controlled classification (if you have categorical labels for prediction in your dataset)
- SGDRegressor if you want to perform controlled regression (if you have a constant target variable for forecasting)
- The MiniBatchKMeans keyboard performs uncontrolled clustering (but then there is no objective way to quantify the quality of the resulting clusters by default).
- ...
Search the grid to find the optimal model hyperparameters (for example, the alpha
regularizer and the number of passes n_iter
for SGDClassifier) ββand evaluate the performance using cross-validation.
After execution, try again using a 2x larger data set (still suitable in memory) and make sure that this significantly improves the accuracy of the forecast. If this is not the case, then do not waste time parallelizing this in the cluster in order to run it on a complete data set, since it will not give any better results.
If it does what you could do, overlay the data into pieces, then the data fragments on each node, examine the SGDClassifier or SGDRegressor model on each node independently of picloud and collect the weights ( coef_
and intercept_
), and then calculate the average weights to build finite linear model and evaluate it on some stretched fragment of your data set.
More on error analysis. See how to build learning curves:
source share