Is dataframe.show () an action in sparks?

I have the following code:

val df_in = sqlcontext.read.json(jsonFile) // the file resides in hdfs //some operations in here to create df as df_in with two more columns "terms1" and "terms2" val intersectUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { seq1 intersect seq2 } ) //intersects two sequences val symmDiffUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { (seq1 diff seq2) ++ (seq2 diff seq1) } ) //compute the difference of two sequences val df1 = (df.withColumn("termsInt", intersectUDF(df("terms1"), df1("terms2") ) ) .withColumn("termsDiff", symmDiffUDF(df("terms1"), df1("terms2") ) ) .where( size(col("termsInt")) >0 && size(col("termsDiff")) > 0 && size(col("termsDiff")) <= 2 ) .cache() ) // add the intersection and difference columns and filter the resulting DF df1.show() df1.count() 

The application runs smoothly and quickly until show() , but in the count() step it creates 40,000 tasks.

I understand that df1.show() should run the full creation of df1 , and df1.count() should be very fast. What am I missing here? why is count() slowing down?

Thank you very much in advance, Roxanne

+5
source share
1 answer

show really an action, but he's smart enough to know when he doesn't need to run everything. If you have orderBy , it will take a very long time, but in this case all your operations are operations with cards, so there is no need to calculate the entire summary table. However, count must physically go through the entire table in order to read it and why it takes so long. You can check what I'm saying by adding the definition of orderBy to df1 - then it will take a lot of time.

EDIT: Also, 40k tasks are probably related to the number of partitions your DF is divided into. Try using df1.repartition(<a sensible number here, depending on cluster and DF size>) and try counting again.

+6
source

Source: https://habr.com/ru/post/1258153/


All Articles