I have a looping operation that generates some RDDs, performs a redistribution, and then an aggregatebykey operation. After the cycle starts onces, it calculates the final RDD, which is cached and verified, and is also used as the original RDD for the next cycle.
These RDDs are quite large and generate many intermediate shuffle blocks before proceeding with the final RDD for each iteration. I compress my shuffles and let the shuffles spill onto the disk.
I notice on my working machines that my working directory, where files are stored in random order, is not cleared. Thus, in the end, I run out of disk space. I got the impression that if I go through the checkpoint of my RDD, it will delete all the intermediate shuffle blocks. However, this does not seem to be happening. Will anyone have any ideas on how I can clear my shuffle blocks after each iteration of the loop, or why my blocks are not being randomly cleared?
source
share