Your interpretation is close to reality, but it seems that you are a little confused at some points.
Weโll see if I can make this more clear to you.
Let's say you have an example of word counting in Scala.
object WordCount { def main(args: Array[String]) { val inputFile = args(0) val outputFile = args(1) val conf = new SparkConf().setAppName("wordCount") val sc = new SparkContext(conf) val input = sc.textFile(inputFile) val words = input.flatMap(line => line.split(" ")) val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y} counts.saveAsTextFile(outputFile) } }
In each spark task, you have an initialization step, where you create a SparkContext object that provides some configuration, such as appname and master, then you read the inputFile, you process it and save the processing result to disk. All this code works in the driver, with the exception of anonymous functions that do the actual processing (functions passed to .flatMap, .map and reduceByKey) and the textFile and saveAsTextFile I / O functions that work remotely in the cluster.
Here, DRIVER is the name that is provided to the part of the program that is executed locally on the same node where you send your code with spark-submit (in your picture, the Node client is called). You can send your code from any computer (either ClientNode, WorderNode, or even MasterNode) if you have a spark source and network access for your YARN cluster. For simplicity, I will assume that the node client is your laptop, and the yarn cluster is from remote computers.
For simplicity, I will leave this image as a Zookeeper, as it is used to ensure high availability of HDFS and is not involved in launching the spark application. I should mention that the Yarn Resource Manager and the HDFS Namenode are roles in Yarn and HDFS (these are actually processes running inside the JVM) and they can live on the same master node or on separate machines. Even Yarn node managers and data nodes are only roles, but they usually live on the same machine to ensure data locality (processing data close to where the data is stored).
When you submit the application, you first contact the resource manager, which along with NameNode will try to find the Worker nodes available to run your spark tasks. To take advantage of the principle of data locality, the resource manager will prefer work nodes that are stored on the same HDFS engine blocks (any of the three replicas for each block) for the file that you must process. If work nodes with these blocks are unavailable, it will use any other work node. In this case, since the data will not be accessible locally, HDFS blocks must be moved across the network from any of the data nodes to the node manager, which launches the spark task. This process is performed for each block that your file has made, so some blocks can be found locally, some of them must be moved.
When the ResourceManager finds an available working node, it will contact NodeManager with this node and ask it to create a Yarn Container (JVM), where it will launch the spark executor. In other cluster modes (Mesos or Standalone) you will not have a container for yarn, but the concept of a spark performer is the same. The artificial executor works as a JVM and can perform several tasks.
The driver running on the node client and the tasks performed by the spark executors continue to communicate in order to complete your work. If the driver runs on your laptop and your laptop crashes, you will lose the connection to the tasks and your work will fail. Therefore, when the spark is running in the thread cluster, you can specify whether you want to run the driver on your laptop "-deploy-mode = client" or on the yarn cluster as another yarn container "--deploy-mode = cluster". See spark-submit for more details.