How Spark achieves parallelism in a single task on multi-core or hyper-threaded machines

Question

How Spark achieves parallelism in a single task on multi-core or hyper-threaded machines

I read and tried to understand how the Spark environment uses its cores in Standalone mode. According to Spark documentation, the value of spark.task.cpus "is set to 1 by default, which means the number of cores allocated for each task.

Question 1: For a multi-core machine (for example, only 4 cores, 8 hardware threads), when "spark.task.cpus = 4", will sparks use 4 cores (1 thread per core) or 2 cores with a hyper-thread?

What happens if I set "spark.task.cpus = 16" more than the number of available hardware threads on this computer?

Question 2: How is this type of parallelism hardware achieved? I tried to look into the code, but could not find anything that binds to the hardware or JVM for the main level of parallelism. For example, if a task is a filter function, then how does one filter task relate to multiple cores or threads?

Maybe something is missing for me. Is this related to the Scala language?

+5

multithreading parallel-processing multicore apache-spark

Nodame Apr 17 '16 at 1:14

source share

1 answer

Dimitar Dimitrov · Accepted Answer · 2016-04-17T06:15:59+0000

To answer your question with a headline, Spark alone does not give you parallelism wins inside the task. The main purpose of the spark.task.cpus parameter is to resolve multithreaded tasks. If you call an external multi-threaded procedure in each task or want to encapsulate the highest level of parallelism yourself at the task level, you can set spark.task.cpus more than 1.

Setting this parameter to more than 1 is not what you often do.
- The scheduler does not start the task if the number of available kernels is less than the kernels required by this task, therefore, if your executor has 8 kernels and you set spark.task.cpus to 3, you will get only 2 jobs running.
- If your task does not consume the full capacity of the cores all the time, you may find that using spark.task.cpus=1 and causing some discrepancies inside the task still gives you great performance.
- The overhead of things like GC or I / O should probably not be included in the spark.task.cpus setting, because it will probably be much more static, which does not scale linearly with your task count.

Question 1 . For a multi-core machine (for example, only 4 cores, 8 hardware threads), when "spark.task.cpus = 4", does Spark use 4 cores (1 thread per core) or 2 cores with hyper-thread?

The JVM will almost always rely on the OS to provide it with information and mechanisms for working with processors, and AFAIK Spark does nothing special here. If Runtime.getRuntime().availableProcessors() or ManagementFactory.getOperatingSystemMXBean().getAvailableProcessors() return 4 for your HT dual-core Intel processor, Spark will also see 4 cores.

Question 2 : How is this type of parallelism hardware achieved? I tried to look into the code, but could not find anything that binds to the hardware or JVM for the main level of parallelism. For example, if a task is a filter function, how is one filter task assigned to multiple cores or threads?

As mentioned above, Spark will not automatically parallelize the task according to the spark.task.cpus parameter. Spark is basically a parallelism data engine, and its parallelism is achieved mainly by representing your data as an RDD.

How Spark achieves parallelism in a single task on multi-core or hyper-threaded machines

More articles: