PySpark: do I need to re-cache a DataFrame?

Say I have a data frame:

rdd = sc.textFile(file) df = sqlContext.createDataFrame(rdd) df.cache() 

and I add a column

df = df.withColumn('c1', lit(0))

I want to use df several times. So do I need re cache() dataframe, or will Spark automatically do this for me?

+5
source share
1 answer

you will have to re-cache the data every time every time you manipulate / modify the data frame. However, the entire data frame does not need to be recounted.

 df = df.withColumn('c1', lit(0)) 

In the above statement, a new dataframe is created and reassigned to the df variable. But this time, only a new column is computed, and the rest is retrieved from the cache.

+5
source

Source: https://habr.com/ru/post/1263828/


All Articles