How to control RDD without saving all its data?

Question

How to control RDD without saving all its data?

I am running a series of jobs, and the intermediate rdd is used in all jobs. So I cached intermediate rdds, but after some iterations of slowing it down. Then I used rdd check pointing after caching to break the line, which is not required. In the spark UI, I can confirm that the check mark is done correctly. But it also takes time because it writes every rdd to the local system. What is an effective way to break an unnecessary line without storing the actual rdd data?

+4

apache-spark spark-streaming

Bhanuday birla Dec 30 '16 at 11:25

source share

1 answer

user7729875 · Answer 1 · 2017-03-18T01:05:12+0000

The exact point of the control point is the storage of all data. This allows you to destroy the line and "forget" about the past. Without saving data that destroys the line, it is simply impossible.

How to control RDD without saving all its data?

More articles: