Spark-submit / spark-shell> difference between client yarn and cluster yarn mode

I am running Spark with YARN.

From the link: http://spark.apache.org/docs/latest/running-on-yarn.html

I found an explanation of the various thread modes, that is, the option - master , with which Spark can work:

"There are two deployment modes that you can use to run Spark applications in YARN. In the threaded cluster mode, the Spark driver starts inside the main application process, which is controlled by YARN in the cluster, and the client can leave after the application starts. In yarn-client mode, the driver runs in the client process, and the application wizard is used only to request resources from YARN "

That way, I can only understand that the difference is where the driver works , but I cannot understand which one works faster . Morevover:

  • In the case of running Spark-submit, the -master switch can be a client or a cluster
  • Accordingly, the spark-shell wizard parameter may be yarn-client , but it does not support cluster mode

So, I do not know how to make a choice, that is, when to use the spark shell, when to use spark-submit, especially when to use client mode, when to use the cluster mode

+5
source share
2 answers

The exceptional shell must be used for interactive queries, it must be run in the "yarn-client" mode so that the machine you are working on acts like a driver.

For spark-submit, you submit tasks to the cluster and then complete the task in the cluster. You usually run in cluster mode so that YARN can assign the driver to the appropriate node in the cluster with available resources.

Some commands (e.g. .collect ()) send all the data to the node driver, which can lead to significant performance differences between your node driver inside the cluster or on a machine outside the cluster (for example, a laptop for users).

+6
source

For training purposes, client mode is good enough. In a production environment, you should ALWAYS use cluster mode.

I will explain to you with an example. Imagine a scenario in which you want to run several applications. Let's say you have a 5 node cluster with nodes A, B, C, D, E.

The workload will be distributed across all 5 work nodes, and 1 node will also be used to send jobs (for example, 'A' is used for this). Now, every time you launch the application in client mode, the driver process always starts on "A".

This may work for multiple jobs, but as the number of jobs "A" increases, there will be less resources such as processor and memory.

Imagine the impact on a very large cluster that performs several of these tasks.

But if you select cluster mode, the driver will work on "A" every time, but will be distributed on all 5 nodes. Resources in this case are more evenly used.

Hope this helps you decide which mode to choose.

+5
source

Source: https://habr.com/ru/post/1234111/


All Articles