Does Spark write intermediate intermediate outputs to disk

I am reading Learning Spark , and I do not understand what this means that the outputs of Spark shuffle are written to disk. See Chapter 8, “Configuring and Debugging a Spark,” pages 148-149:

Sparks internal scheduler can trim the line of an RDD plot if an existing RDD is already stored in cluster memory or on disk. The second case where this truncation can happen is when the RDD has already materialized as a side effect of an earlier shuffle , even if it has not been explicitly saved. This is an optimization under the hood that exploits the fact that Spark shuffle outputs are written to disk , and it exploits the fact that many times parts of the RDD plot are recounted.

As I understand it, there are various constant policies, for example, by default MEMORY_ONLY, which assume that the intermediate result will never be saved to disk.

When and why will shuffle be saved to disk? How can this be reused by further calculations?

+4
source share
1 answer

When

This occurs when an operation requiring shuffling is first evaluated (action) and cannot be disabled

Why

This is an optimization. Shuffling is one of the expensive things that happen in Spark.

How can this be reused in further calculations?

It is automatically reused with any subsequent action performed on the same RDD.

+4

Source: https://habr.com/ru/post/1662732/


All Articles