How are tasks distributed in a Spark cluster?

So, I have an input, which consists of a data set and several ML algorithms (with parameter settings) using scikit-learn. I tried to make several attempts to do this as efficiently as possible, but at this point I still do not have the appropriate infrastructure to evaluate my results. However, I do not have enough background in this area, and I need help to make things clear.

Basically, I want to know how tasks are distributed in such a way as to maximize the use of all available resources and what is actually done implicitly (for example, Spark) and what is not.

This is my scenario: enter image description here

I need to train many different models of the decision tree (as much as a combination of all possible parameters), many different models of Random Forest, etc.

, ML .

spark.parallelize(algorithms).map(lambda algorihtm: run_experiment(dataframe, algorithm))

run_experiment GridSearchCV ML . n_jobs=-1, () parallelism.

, Spark- , , ?

enter image description here

, Random Forest, node? , , , .

, , parallelize for GridSearchCV databricks spark-sklearn Spark scikit-learn? , , :

enter image description here

, , ML, Spark MLlib scikit-learn, /?

, , . .


, CS stackexchange.

+4
1

spark.parallelize(algorithms).map(...)

ref, " , , ". , . .

, , , , .

. , , ( ), .


, parallelize for?

. ( ) RDD, .

.. spark-sklearn databricks Spark scikit-learn?

, Random Forests:

" scikit-learn Spark -, Spark-. node , scikit-learn, ."

, .


MLlib scikit-learn, /?

, . , , .


, , .

0

Source: https://habr.com/ru/post/1677992/


All Articles