Trying to find complete documentation on the internal architecture of Apache Spark, but there are no results.
For example, Iβm trying to understand the following: Suppose we have a 1Tb text file on HDFS (3 nodes in the cluster, replication rate - 1). This file will be split into 128 MB chunks, and each chunk will be stored on only one node. We run Spark Workers on these sites. I know that Spark is trying to work with data stored in HDFS on the same node (to avoid network I / O). For example, I'm trying to make the number of words in this text file 1 TB.
Here I have the following questions:
- Will Spark load a cartridge (128 MB) into RAM, count words, then delete it from memory and do it sequentially? What if there is no free RAM?
- When will Spark use non-local data on HDFS?
- What if I need to perform a more complex task, when the results of each iteration of each Worker need to be transferred to all other Workers (shuffling?), I need to write them on my own in HDFS, and then read them? For example, I cannot understand how K-means clustering or descent of a gradient on Spark.
I would appreciate any link to the Apache Spark Architecture Guide.
source share