PySpark: do I need to re-cache a DataFrame?

Question

PySpark: do I need to re-cache a DataFrame?

Say I have a data frame:

rdd = sc.textFile(file) df = sqlContext.createDataFrame(rdd) df.cache()

and I add a column

df = df.withColumn('c1', lit(0))

I want to use df several times. So do I need re cache() dataframe, or will Spark automatically do this for me?

+5

apache-spark pyspark apache-spark-sql spark-dataframe

PSNR Feb 05 '17 at 1:34

source share

1 answer

rogue-one · Answer 1 · 2017-02-05T02:22:02+0000

you will have to re-cache the data every time every time you manipulate / modify the data frame. However, the entire data frame does not need to be recounted.

 df = df.withColumn('c1', lit(0))

In the above statement, a new dataframe is created and reassigned to the df variable. But this time, only a new column is computed, and the rest is retrieved from the cache.

PySpark: do I need to re-cache a DataFrame?

More articles: