Apache Spark Architecture

Trying to find complete documentation on the internal architecture of Apache Spark, but there are no results.

For example, I’m trying to understand the following: Suppose we have a 1Tb text file on HDFS (3 nodes in the cluster, replication rate - 1). This file will be split into 128 MB chunks, and each chunk will be stored on only one node. We run Spark Workers on these sites. I know that Spark is trying to work with data stored in HDFS on the same node (to avoid network I / O). For example, I'm trying to make the number of words in this text file 1 TB.

Here I have the following questions:

  • Will Spark load a cartridge (128 MB) into RAM, count words, then delete it from memory and do it sequentially? What if there is no free RAM?
  • When will Spark use non-local data on HDFS?
  • What if I need to perform a more complex task, when the results of each iteration of each Worker need to be transferred to all other Workers (shuffling?), I need to write them on my own in HDFS, and then read them? For example, I cannot understand how K-means clustering or descent of a gradient on Spark.

I would appreciate any link to the Apache Spark Architecture Guide.

+6
source share
2 answers

Adding to the other answers, here I would like to include a Spark core architecture diagram, as mentioned in the question.

A wizard is an entry point.

spark-core-architecture

+3
source

Here are the answers to your questions.

  • Spark will try to load 128 MB of the fragment into memory and process it in RAM. Keep in mind that the size in memory can be several times larger than the original size of the raw file due to Java overhead (Java headers, etc.). In my experience, this can be 2-4 times larger. If there is not enough memory (RAM), Spark will spill data onto the local drive. You can set these two parameters to minimize spill: spark.shuffle.memoryFraction and spark.storage.memoryFraction .

  • Spark will always try to use local data from your HDFS. If a piece is not available locally, it will retrieve it from another node in the cluster. more details

  • When shuffling, you don’t need to manually save intermediate results in HDFS. Spark will write the results to local storage and shuffle only the data necessary to maximize reuse of local storage for the next step.

Here's a nice video that details Spark architecture, what happens during shuffle, and performance optimization tips.

+4
source

Source: https://habr.com/ru/post/988909/


All Articles