I have the following code:
val df_in = sqlcontext.read.json(jsonFile) // the file resides in hdfs //some operations in here to create df as df_in with two more columns "terms1" and "terms2" val intersectUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { seq1 intersect seq2 } ) //intersects two sequences val symmDiffUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { (seq1 diff seq2) ++ (seq2 diff seq1) } ) //compute the difference of two sequences val df1 = (df.withColumn("termsInt", intersectUDF(df("terms1"), df1("terms2") ) ) .withColumn("termsDiff", symmDiffUDF(df("terms1"), df1("terms2") ) ) .where( size(col("termsInt")) >0 && size(col("termsDiff")) > 0 && size(col("termsDiff")) <= 2 ) .cache() ) // add the intersection and difference columns and filter the resulting DF df1.show() df1.count()
The application runs smoothly and quickly until show() , but in the count() step it creates 40,000 tasks.
I understand that df1.show() should run the full creation of df1 , and df1.count() should be very fast. What am I missing here? why is count() slowing down?
Thank you very much in advance, Roxanne
source share