How to control RDD without saving all its data?

I am running a series of jobs, and the intermediate rdd is used in all jobs. So I cached intermediate rdds, but after some iterations of slowing it down. Then I used rdd check pointing after caching to break the line, which is not required. In the spark UI, I can confirm that the check mark is done correctly. But it also takes time because it writes every rdd to the local system. What is an effective way to break an unnecessary line without storing the actual rdd data?

+4
source share
1 answer

The exact point of the control point is the storage of all data. This allows you to destroy the line and "forget" about the past. Without saving data that destroys the line, it is simply impossible.

0
source

Source: https://habr.com/ru/post/1665315/


All Articles