Spark clears shuffle spilled onto disk

I have a looping operation that generates some RDDs, performs a redistribution, and then an aggregatebykey operation. After the cycle starts onces, it calculates the final RDD, which is cached and verified, and is also used as the original RDD for the next cycle.

These RDDs are quite large and generate many intermediate shuffle blocks before proceeding with the final RDD for each iteration. I compress my shuffles and let the shuffles spill onto the disk.

I notice on my working machines that my working directory, where files are stored in random order, is not cleared. Thus, in the end, I run out of disk space. I got the impression that if I go through the checkpoint of my RDD, it will delete all the intermediate shuffle blocks. However, this does not seem to be happening. Will anyone have any ideas on how I can clear my shuffle blocks after each iteration of the loop, or why my blocks are not being randomly cleared?

+4
source share
2 answers

Once you have mounted the RDD on your memory / disk while the spark context is alive, the RDD will be stored in your memory / disk.

, RDD /, unpersist().

java-doc:

 /**
   * Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
   *
   * @param blocking Whether to block until all blocks are deleted.
   * @return This RDD.
   */
  def unpersist(blocking: Boolean = true)

, :

rdd.unpersist()

+1

, RDD. :

val rdd2 = rdd1.<transformation>
val rdd3 = rdd2.<transformation>
...

, RDD, - ( rdd gc rdd, ).
persist() , - localCheckpoint(). , :

rdd.persist(StorageLevel.MEMORY_AND_DISK_2)
          .localCheckpoint()
// do sth here and later
rdd.unpersist()

, (), .
, , : https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-checkpointing.html

0

Source: https://habr.com/ru/post/1624259/


All Articles