Plan sparks for collect_list

Question

Plan sparks for collect_list

This is with reference Jacek's answer to the question of how to get the size of the result generated with concat_ws?

The DSL request in the response calls collect_listtwice concatand sizeseparately.

input.groupBy($"col1").agg(
     concat_ws(",", collect_list($"COL2".cast("string"))) as "concat",
     size(collect_list($"COL2".cast("string"))) as "size"
)

With optimizedPlanhow:

Aggregate [COL1#9L], 
[COL1#9L,
concat_ws(,,(hiveudaffunction(HiveFunctionWrapper(GenericUDAFCollectList,GenericUDAFCollectList@470a4e26),cast(COL2#10L as string),false,0,0),mode=Complete,isDistinct=false)) AS concat#13,
size((hiveudaffunction(HiveFunctionWrapper(GenericUDAFCollectList,GenericUDAFCollectList@2602f45),cast(COL2#10L as string),false,0,0),mode=Complete,isDistinct=false)) AS size#14]
+- Project [(id#8L % 2) AS COL1#9L,id#8L AS COL2#10L]
   +- LogicalRDD [id#8L], MapPartitionsRDD[20] at range at <console>:25

How will this differ in performance if I use collect_listonly once and then the withColumnAPI to create two other columns?

input
  .groupBy("COL1")
  .agg(collect_list($"COL2".cast("string")).as("list") )
  .withColumn("concat", concat_ws("," , $"list"))
  .withColumn("size", size($"list"))
  .drop("list")

What has optimizedPlanlike:

Project [COL1#9L,
concat_ws(,,list#17) AS concat#18,
size(list#17) AS size#19]
+- Aggregate [COL1#9L], 
[COL1#9L,(hiveudaffunction(HiveFunctionWrapper(GenericUDAFCollectList,GenericUDAFCollectList@5cb88b6b),
   cast(COL2#10L as string),false,0,0),mode=Complete,isDistinct=false) AS list#17]
   +- Project [(id#8L % 2) AS COL1#9L,id#8L AS COL2#10L]
      +- LogicalRDD [id#8L], MapPartitionsRDD[20] at range at <console>:25

I see that it collect_listis called twice in the first example, but I just wanted to find out if there are any significant differences. Using Spark 1.6.

+4

apache-spark apache-spark-sql

philantrovert Nov 27 '17 at 8:44

source share

No one has answered this question yet.

See similar questions:

2

How to get the size of the result generated with concat_ws?

or similar:

218

Difference between DataFrame, Dataset and RDD in Spark

137