This usually means that the data has been extracted from the cache, and there is no need to repeat this step. This is consistent with your DAG, which shows that the next step requires shuffling ( reduceByKey ). Whenever a shuffle occurs, Spark automatically caches the generated data :
Shuffle also generates a large number of intermediate files on disk. Starting with Spark 1.3, these files are saved until the corresponding RDDs are no longer used and garbage collected. This is to ensure that the shuffle files do not need to be re-created if the line is recalculated.
zero323 Jan 03 '15 at 20:19 2016-01-03 20:19
source share