Is cluster manager a long-term service that node is running on?
Cluster Manager is just a resource manager, i.e. The CPU and RAM that SchedulerBackends uses to run tasks. Cluster Manager does nothing for Apache Spark, but offers resources, and after running Spark artists, they interact directly with the driver to launch tasks.
You can start the stand-alone main server by doing:
./sbin/start-master.sh
It can be launched anywhere.
To run an application in a Spark cluster
./bin/spark-shell --master spark://IP:PORT
Is it possible that the Master and Driver nodes will be the same? I suppose there should be a rule indicating that the two nodes should be different?
In offline mode, when you start your machine, the JVM starts. Your SparK Master will start, and the JVM Worker will start on each machine, and they will register with Spark Master. Both are resource managers. When you launch the application or send your application in cluster mode, the driver starts where you make ssh to run this application. Driver JVM will contact SparK Master for artists (Ex) and, in standalone mode, Worker will launch Ex. Thus, Spark Master is for each cluster, and the JVM driver is for each application.
In the event of a node driver failure, who is responsible for restarting the application? and what will be exactly? for example, how will the Master nodes, Cluster Manager, and Workers nodes (if any) be involved and in what order?
If the Ex JVM fails, the working JVM starts Ex and when the Worker JVM fails, Spark Master starts them. And with a stand-alone Spark cluster with cluster deployment mode, you can also specify - monitor to make sure the driver will automatically restart if it fails with a non-zero exit code. Spark Master will launch the JVM driver
Similar to the previous question: if the node wizard fails, what happens exactly and who is responsible for recovering from the failure?
refusal of the master will lead to the fact that the performers will not be able to contact him. Thus, they will stop working. The failure of the master will cause the driver not to communicate with him for the status of the work. Therefore, your application will fail. The initial losses will be confirmed by running applications, but otherwise they should continue to work more or less, since nothing happened with two important exceptions:
1. The application will not be able to finish in an elegant way.
2.if Spark Master does not work. The employee will try to re-register WithMaster. If this fails repeatedly, the workers will simply refuse.
reregisterWithMaster () - re-registers with the active master with whom this employee talked. If it is not there, then this worker is still loading and has not yet established a connection with the master, in which case we must re-register with all the masters. It is important to re-register only with the active master during failures. Worker unconditionally tries to re-register with all the masters, a race condition may occur. The horror described in detail in SPARK-4592:
At this point, long-running applications will not be able to continue processing, but this should not lead to an immediate failure. Instead, the application will wait until the wizard returns to online mode (restoring the file system) or the contact of the new leader (Zookeeper mode), and if this happens, it will continue processing.