This is with reference Jacek's answer to the question of how to get the size of the result generated with concat_ws?
The DSL request in the response calls collect_listtwice concatand sizeseparately.
input.groupBy($"col1").agg(
concat_ws(",", collect_list($"COL2".cast("string"))) as "concat",
size(collect_list($"COL2".cast("string"))) as "size"
)
With optimizedPlanhow:
Aggregate [COL1#9L],
[COL1#9L,
concat_ws(,,(hiveudaffunction(HiveFunctionWrapper(GenericUDAFCollectList,GenericUDAFCollectList@470a4e26),cast(COL2#10L as string),false,0,0),mode=Complete,isDistinct=false)) AS concat#13,
size((hiveudaffunction(HiveFunctionWrapper(GenericUDAFCollectList,GenericUDAFCollectList@2602f45),cast(COL2#10L as string),false,0,0),mode=Complete,isDistinct=false)) AS size#14]
+- Project [(id#8L % 2) AS COL1#9L,id#8L AS COL2#10L]
+- LogicalRDD [id#8L], MapPartitionsRDD[20] at range at <console>:25
How will this differ in performance if I use collect_listonly once and then the withColumnAPI to create two other columns?
input
.groupBy("COL1")
.agg(collect_list($"COL2".cast("string")).as("list") )
.withColumn("concat", concat_ws("," , $"list"))
.withColumn("size", size($"list"))
.drop("list")
What has optimizedPlanlike:
Project [COL1#9L,
concat_ws(,,list#17) AS concat#18,
size(list#17) AS size#19]
+- Aggregate [COL1#9L],
[COL1#9L,(hiveudaffunction(HiveFunctionWrapper(GenericUDAFCollectList,GenericUDAFCollectList@5cb88b6b),
cast(COL2#10L as string),false,0,0),mode=Complete,isDistinct=false) AS list#17]
+- Project [(id#8L % 2) AS COL1#9L,id#8L AS COL2#10L]
+- LogicalRDD [id#8L], MapPartitionsRDD[20] at range at <console>:25
I see that it collect_listis called twice in the first example, but I just wanted to find out if there are any significant differences. Using Spark 1.6.