Spark tasks are not well distributed

I am starting Spark, and it seems that the tasks are not very distributed (see attached). Is there a way to distribute tasks more evenly? Thanks!

enter image description here

+6
source share
4 answers

I think the task is distributed evenly between different workers, because each task has a different port number in the address column.

0
source

Just looking at your screenshot is quite difficult to diagnose something. However, there are two things you can consider:

  • Spark UI (starting from 1.3.1, I have not tried it in 1.4.0) only the sum of statistics for ready-made tasks is shown. If you took this screenshot while your application is running, it is possible that some tasks were performed and simply did not appear in the statistics!

  • At this stage of Spark, you cannot have more tasks than a data section. It is hard to say without additional code, but you can use the rdd.partition () function, usually you can use rdd.repartition(sparkContext.getConf.getInt("spark.executor.instances", defaultValueInt) to generate more sections before processing and, therefore, smoothing the load on the performers

0
source

If you want equal distribution, you can use the spark section functionality when loading a file in RDD,

 val ratings = sc.textFile(File_name,partition) 

Like you have 10 nodes with 2 cores each, then you can have 20 partition values ​​and similarly.

0
source

Having looked closely at the published image, I can distinguish two main facts:

  • the number of tasks is distributed evenly with a maximum number of 20 tasks.
  • The execution time assigned for each executor differs significantly from 3.0 minutes (~ 80 tasks) to 17.0 minutes (~ 60 tasks).

It makes me think about the nature of your application. Are all tasks equal or do some of them need more time than others? If tasks are not uniform, your problem should be considered more carefully. Imagine the following scenario:

  • Number of tasks: 20, each of which takes 10 seconds, with the exception of the last:

     Task 01: 10 seconds Task 02: 10 seconds Task 03: 10 seconds Task ... Task 20: 120 seconds 
  • Number of artists: 4 (each with one core)

If we had to evenly distribute tasks, we would see that each executor would have to process 5 tasks in total. Taking into account that 20 tasks are assigned to one executor, which require 120 seconds to complete, the flow of execution will be as follows:

  • In the second 40, each performer will be able to complete the first 4 tasks, given that the 20th task is left at the end.
  • In the second 50, each artist, except for one, will fulfill all his tasks. The remaining artist will still calculate 20 tasks that will end after 120 seconds.

At the end, the user interface will display a result similar to yours, with the number of tasks distributed but not the actual calculation time.

 Executor 01 -> tasks completed: 5 -> time: 0:50 minutes Executor 02 -> tasks completed: 5 -> time: 0:50 minutes Executor 03 -> tasks completed: 5 -> time: 0:50 minutes Executor 04 -> tasks completed: 5 -> time: 2:40 minutes 

Although this is not the same, something similar can happen in your situation.

0
source

Source: https://habr.com/ru/post/989260/


All Articles