Sparkling Yarn Architecture

Question

Sparkling Yarn Architecture

I had a question regarding this image in the tutorial in which I followed. Thus, based on this image in a yarn-based architecture, the execution of the spark application looks something like this:

First you have a driver that runs on the node client or some node data. In this driver (similar to the driver in java?) It consists of your code (written in java, python, scala, etc.) that you send to the Spark context. Then this spark context is a connection to HDFS and sends your request to the resource manager in the Hadoop ecosystem. The resource manager then contacts the name node to find out which data nodes in the cluster contain the information requested by the node client. The source context also places the worker in the working node, which will run the tasks. Then, the node manager will launch the executor, which will run the tasks provided to it by the Spark context and return the data requested by the client from HDFS to the driver.

Is this interpretation correct?

Will the driver also send three executors to each of the node data to extract data from HDFS, since the data in HDFS is replicated 3 times to different data nodes?

+5

scala hadoop hdfs apache-spark

LP496 Mar 25 '16 at 7:18

source share

1 answer

PinoSan · Accepted Answer · 2016-03-25T19:42:35+0000

Your interpretation is close to reality, but it seems that you are a little confused at some points.

We’ll see if I can make this more clear to you.

Let's say you have an example of word counting in Scala.

object WordCount { def main(args: Array[String]) { val inputFile = args(0) val outputFile = args(1) val conf = new SparkConf().setAppName("wordCount") val sc = new SparkContext(conf) val input = sc.textFile(inputFile) val words = input.flatMap(line => line.split(" ")) val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y} counts.saveAsTextFile(outputFile) } }

In each spark task, you have an initialization step, where you create a SparkContext object that provides some configuration, such as appname and master, then you read the inputFile, you process it and save the processing result to disk. All this code works in the driver, with the exception of anonymous functions that do the actual processing (functions passed to .flatMap, .map and reduceByKey) and the textFile and saveAsTextFile I / O functions that work remotely in the cluster.

Here, DRIVER is the name that is provided to the part of the program that is executed locally on the same node where you send your code with spark-submit (in your picture, the Node client is called). You can send your code from any computer (either ClientNode, WorderNode, or even MasterNode) if you have a spark source and network access for your YARN cluster. For simplicity, I will assume that the node client is your laptop, and the yarn cluster is from remote computers.

For simplicity, I will leave this image as a Zookeeper, as it is used to ensure high availability of HDFS and is not involved in launching the spark application. I should mention that the Yarn Resource Manager and the HDFS Namenode are roles in Yarn and HDFS (these are actually processes running inside the JVM) and they can live on the same master node or on separate machines. Even Yarn node managers and data nodes are only roles, but they usually live on the same machine to ensure data locality (processing data close to where the data is stored).

When you submit the application, you first contact the resource manager, which along with NameNode will try to find the Worker nodes available to run your spark tasks. To take advantage of the principle of data locality, the resource manager will prefer work nodes that are stored on the same HDFS engine blocks (any of the three replicas for each block) for the file that you must process. If work nodes with these blocks are unavailable, it will use any other work node. In this case, since the data will not be accessible locally, HDFS blocks must be moved across the network from any of the data nodes to the node manager, which launches the spark task. This process is performed for each block that your file has made, so some blocks can be found locally, some of them must be moved.

When the ResourceManager finds an available working node, it will contact NodeManager with this node and ask it to create a Yarn Container (JVM), where it will launch the spark executor. In other cluster modes (Mesos or Standalone) you will not have a container for yarn, but the concept of a spark performer is the same. The artificial executor works as a JVM and can perform several tasks.

The driver running on the node client and the tasks performed by the spark executors continue to communicate in order to complete your work. If the driver runs on your laptop and your laptop crashes, you will lose the connection to the tasks and your work will fail. Therefore, when the spark is running in the thread cluster, you can specify whether you want to run the driver on your laptop "-deploy-mode = client" or on the yarn cluster as another yarn container "--deploy-mode = cluster". See spark-submit for more details.

Sparkling Yarn Architecture

More articles: