Understand Spark: Cluster Manager, wizard nodes, and driver

After reading this question , I would like to ask additional questions:

  • Is cluster manager a long-term service that node is running on?
  • Is it possible that the Master and Driver nodes will be the same? I suppose there should be a rule that says that the two nodes must be different?
  • In the event of a node driver failure, who is responsible for restarting the application? and what will be exactly? for example, how will the Master nodes, Cluster Manager, and Workers nodes (if any) be involved and in what order?
  • Similar to the previous question: if the node wizard fails, what will happen exactly and who is responsible for recovering from the failure?
+12
source share
2 answers

1. Cluster Manager is a long-term service, on which node does it work?

Cluster Manager is the main Spark offline process . You can run it anywhere by running ./sbin/start-master.sh , in YARN it will be the Resource Manager.

2. Is it possible that the Master and Driver nodes will be the same machine? I suppose there must be a rule somewhere stating that the two nodes must be different?

Master for each cluster, and Driver for each application. For standalone clusters / yarn, Spark currently supports two deployment modes.

  1. In client mode, the driver starts in the same process as the client that sends the application .
  2. However, in cluster mode for offline mode, the driver starts from one of the workers & for yarn, it starts inside the main node of the application , and the client process ends as soon as it fulfills its responsibility to send the application, without waiting for its completion.

If the application is sent with --deploy-mode client on the main node, both the master and the driver will be on the same node . Verify Spark Application Deployment Through YARN

3. In the event of a driver node failure, who is responsible for restarting the application? And what exactly will be? that is, how will the main node, the Cluster Manager and Workers nodes (if they do) be involved, and in what order?

In the event of a driver failure, all tasks of the performers will be killed for this sent / launched ignition application.

4. In the event of a failure of the main node, what exactly will happen and who is responsible for recovering from the failure?

Master node errors are handled in two ways.

  1. Backup Masters with ZooKeeper:

    Using ZooKeeper to provide leadership choices and some public storage, you can run several wizards in your cluster connected to the same ZooKeeper instance. One will be elected as a β€œleader,” while the others will remain on standby. If the current leader dies, another Master will be elected, restore the old state of masters, and then resume planning. The entire recovery process (from the time of the first leader goes down) should take from 1 to 2 minutes. Please note that this delay only affects the planning of new applications - applications that were already running during the transition to another resource without changes. check here for configurations

  2. Restoring a single node with a local file system:

    ZooKeeper is the best way to achieve a high level of production availability, but if you want to be able to restart the wizard if it turns off, FILESYSTEM mode can take care of this. When applications and workers are registered, they only have to write to the directory so that they can be restored after restarting the wizard process. check here for conf and more details

+15
source

Is cluster manager a long-term service that node is running on?

Cluster Manager is just a resource manager, i.e. The CPU and RAM that SchedulerBackends uses to run tasks. Cluster Manager does nothing for Apache Spark, but offers resources, and after running Spark artists, they interact directly with the driver to launch tasks.

You can start the stand-alone main server by doing:

 ./sbin/start-master.sh 

It can be launched anywhere.

To run an application in a Spark cluster

 ./bin/spark-shell --master spark://IP:PORT 

Is it possible that the Master and Driver nodes will be the same? I suppose there should be a rule indicating that the two nodes should be different?

In offline mode, when you start your machine, the JVM starts. Your SparK Master will start, and the JVM Worker will start on each machine, and they will register with Spark Master. Both are resource managers. When you launch the application or send your application in cluster mode, the driver starts where you make ssh to run this application. Driver JVM will contact SparK Master for artists (Ex) and, in standalone mode, Worker will launch Ex. Thus, Spark Master is for each cluster, and the JVM driver is for each application.

In the event of a node driver failure, who is responsible for restarting the application? and what will be exactly? for example, how will the Master nodes, Cluster Manager, and Workers nodes (if any) be involved and in what order?

If the Ex JVM fails, the working JVM starts Ex and when the Worker JVM fails, Spark Master starts them. And with a stand-alone Spark cluster with cluster deployment mode, you can also specify - monitor to make sure the driver will automatically restart if it fails with a non-zero exit code. Spark Master will launch the JVM driver

Similar to the previous question: if the node wizard fails, what happens exactly and who is responsible for recovering from the failure?

refusal of the master will lead to the fact that the performers will not be able to contact him. Thus, they will stop working. The failure of the master will cause the driver not to communicate with him for the status of the work. Therefore, your application will fail. The initial losses will be confirmed by running applications, but otherwise they should continue to work more or less, since nothing happened with two important exceptions:

1. The application will not be able to finish in an elegant way.

2.if Spark Master does not work. The employee will try to re-register WithMaster. If this fails repeatedly, the workers will simply refuse.

reregisterWithMaster () - re-registers with the active master with whom this employee talked. If it is not there, then this worker is still loading and has not yet established a connection with the master, in which case we must re-register with all the masters. It is important to re-register only with the active master during failures. Worker unconditionally tries to re-register with all the masters, a race condition may occur. The horror described in detail in SPARK-4592:

At this point, long-running applications will not be able to continue processing, but this should not lead to an immediate failure. Instead, the application will wait until the wizard returns to online mode (restoring the file system) or the contact of the new leader (Zookeeper mode), and if this happens, it will continue processing.

+5
source

Source: https://habr.com/ru/post/1240343/


All Articles