How are tasks distributed in a Spark cluster?

Question

How are tasks distributed in a Spark cluster?

So, I have an input, which consists of a data set and several ML algorithms (with parameter settings) using scikit-learn. I tried to make several attempts to do this as efficiently as possible, but at this point I still do not have the appropriate infrastructure to evaluate my results. However, I do not have enough background in this area, and I need help to make things clear.

Basically, I want to know how tasks are distributed in such a way as to maximize the use of all available resources and what is actually done implicitly (for example, Spark) and what is not.

This is my scenario:

I need to train many different models of the decision tree (as much as a combination of all possible parameters), many different models of Random Forest, etc.

, ML .

spark.parallelize(algorithms).map(lambda algorihtm: run_experiment(dataframe, algorithm))

run_experiment GridSearchCV ML . n_jobs=-1, () parallelism.

, Spark- , , ?

, Random Forest, node? , , , .

, , parallelize for GridSearchCV databricks spark-sklearn Spark scikit-learn? , , :

, , ML, Spark MLlib scikit-learn, /?

, , . .

_{, CS stackexchange.}

+4

scikit-learn parallel-processing machine-learning cluster-computing apache-spark

Larissa Leite 26 '17 12:49

1

gsamaras · Accepted Answer · 2017-05-26T17:50:13+0000

spark.parallelize(algorithms).map(...)

ref, " , , ". , . .

, , , , .

spark . , , ( ), .

, parallelize for?

. ( ) RDD, .

.. spark-sklearn databricks Spark scikit-learn?

, Random Forests:

" scikit-learn Spark -, Spark-. node , scikit-learn, ."

, .

MLlib scikit-learn, /?

, . , , .

_{, , .}

How are tasks distributed in a Spark cluster?

More articles: