Scans and processes Apache Spark at the same time, or first reads the entire file in memory, and then starts the conversion?

Question

Scans and processes Apache Spark at the same time, or first reads the entire file in memory, and then starts the conversion?

I am curious if Spark first reads the entire file into memory and only then starts processing it, which means applying transformations and actions, or does it read the first fragment of the file - applies the transform to it, reads the second piece, etc.

Is there a difference between Spark in Hadoop for the same question? I read that Spark stores the entire file in memory most of the time, but Hadoop does not. But what about the initial step when we read it for the first time and match the keys.

thanks

+5

hadoop apache-spark

Yohan roth Dec 21 '16 at 14:02

source share

2 answers

Spark uses lazy iterators to process data and, if necessary, can spill data to disk. It does not read all the data in memory.

The difference compared to Hadoop is that Spark can combine multiple operations together.

+1

user7326335 Dec 21 '16 at 14:57

source share

Armin braun · Accepted Answer · 2016-12-21T20:41:57+0000

I think a fair description would be:

Both Hadoop (or rather MapReduce) and Spark use the same basic HDFS file system to start with.

At the stage of comparison, both will read all the data and actually write the result of the card to disk so that it can be sorted and distributed between nodes through the Shuffle logic. Both of them are actually trying to cache the data just displayed in memory, in addition to spilling it onto the disk so that Shuffle does its job. The difference here is that Spark is much more efficient in this process, trying to optimally match the selected node for a particular calculation with the already cached data on a particular node. Because Spark also does something called lazy evaluation, using Spark in memory is very different from Hadoop as a result of scheduling and caching at the same time.

In the job step of the word job, Hadoop does the following:

Match all words with 1 .
Write all of these displayed pairs (word, 1) into a single file in HDFS (a single file can still span multiple nodes on a distributed HDFS) (this is the shuffling phase)
Sorting strings (word, 1) in this shared file (this is the sorting phase)
Ask the reducers to read the sections (sections) from this general file, which now contain all the words sorted and summing all those 1 for each word.

Spark, on the other hand, will go the other way around:

He shows that, like in Hadoop, probably the most effective is that all these words are summed up using separate Reducer runs, so he decides for some reason that he wants to split the task into x parts and then combine them into the final result.
Thus, he knows that the words must be sorted, which will require at least part of them in memory at a given time.
After that, he estimates that for such a calculation, for such a calculation list, all the words associated with the pairs (word, 1) will be required.
It works through steps 3 than 2 than 1.

Now the trick related to Hadoop is that he knows that in step 3 he caches the elements that he will need in 2. and 2. he already knows how these parts (mostly KV pairs) will be needed in the final step 1. This allows Spark to very effectively plan the execution of Jobs, but caching the data that he knows will be needed in the next stages of work. Hadoop, working from the very beginning (matching) to the end, clearly not looking forward to the next steps, simply cannot use memory efficiently and, therefore, does not waste resources by storing large chunks of memory that Spark will retain. Unlike Spark, he simply does not know if all the pairs in the Map phase are needed in the next step.

The fact that Spark stores the entire dataset in memory is not something that Spark is actively doing, but rather a consequence of how Spark can schedule a job. On the other hand, Spark can actually store less memory in another job. In my opinion, a good example is counting the number of different words. Here, Spark planned ahead and immediately discarded the repeated word from the cache / memory when it came across it during matching, while in Hadoop it would have continued and wasted memory shuffling the repeated words too (I admit that there are a million ways to also do Hadoop do it, but it's not out of the box, there are also ways to write your Spark work in unsuccessful ways to break these optimizations, but it's not so easy to trick Spark here :)).

I hope this helps to understand that memory usage is just a natural consequence of how Spark works, but not something that is actively aimed at, and also not something strictly required by Spark. It is also perfectly capable of repeatedly spilling data back to disk between stages of execution when memory becomes a problem.

For a deeper understanding of this, I recommend learning about Spark's DAG scheduler from here to learn how to do this in code. You will see that it always follows the development pattern, where the data will be cached before calculating what to calculate where.

Scans and processes Apache Spark at the same time, or first reads the entire file in memory, and then starts the conversion?

More articles: