What factors affect the number of sparks at the same time

We recently created a Spark Job Server to which search queries were sent. But we found that our 20 nodes (8 cores / 128G of memory per node) in a spark cluster can only afford 10 spark jobs that work simultaneously.

Can someone share some detailed information about what factors will really affect how many spark jobs can be run simultaneously? How can we configure conf so that we can make full use of the cluster?

+4
source share
2 answers

There is no context in the question, but first - it seems that the Spark Job Server limits the number of parallel jobs (unlike Spark itself, which limits the number of jobs, not jobs):

From application.conf

 # Number of jobs that can be run simultaneously per context
 # If not set, defaults to number of cores on machine where jobserver is running
 max-jobs-per-context = 8

If this is not a problem (you set a limit higher or use more than one context), then the total number of cores in a cluster (8 * 20 = 160) is the maximum number of simultaneous tasks. If each of your tasks creates 16 tasks, Spark will queue for the next incoming task, waiting for processors to be available.

Spark , repartition coalesce RDD/DataFrame, . , RDD (, union), .

+1

, parallelism, :

  • ( ), , . , 20 , 10 , 10 ( , - , LZO ..).
  • take() ( ), , , , take. ( , )

? .

0

Source: https://habr.com/ru/post/1626204/


All Articles