Why does Spark shuffle store intermediate data on disk?

Why are intermediate data stored on disk during storage? I am trying to understand why it cannot be stored in memory. What are the problems with writing to memory?

Is any work being done on writing to memory?

+6
source share
1 answer

Spark stores intermediate data on disk from a shuffle operation as part of under-the-hood optimization. When a spark needs to recount part of an RDD plot, it can truncate the line of an RDD plot if RDD is already present as a side effect of an earlier shuffle. This can happen even if the RDD is not cached or explicitly stored.

The source of this answer is O'Reilly's book Exploring the Spark by Karau, Konwinsky, Wendell, and Zachariah. Chapter 8: Configuring and debugging Spark. Section: Execution Components: tasks, tasks and stages.

+4
source

Source: https://habr.com/ru/post/979145/


All Articles