Why does Spark shuffle store intermediate data on disk?

Question

Why does Spark shuffle store intermediate data on disk?

Why are intermediate data stored on disk during storage? I am trying to understand why it cannot be stored in memory. What are the problems with writing to memory?

Is any work being done on writing to memory?

+6

shuffle apache-spark

Venkat ankam Dec 04 '14 at 21:13

source share

1 answer

rainman · Answer 1 · 2015-03-17T04:04:16+0000

Spark stores intermediate data on disk from a shuffle operation as part of under-the-hood optimization. When a spark needs to recount part of an RDD plot, it can truncate the line of an RDD plot if RDD is already present as a side effect of an earlier shuffle. This can happen even if the RDD is not cached or explicitly stored.

The source of this answer is O'Reilly's book Exploring the Spark by Karau, Konwinsky, Wendell, and Zachariah. Chapter 8: Configuring and debugging Spark. Section: Execution Components: tasks, tasks and stages.

Why does Spark shuffle store intermediate data on disk?

More articles: