How to get the size of the result generated with concat_ws?

I execute groupByin COL1 and get the combined list of COL2 with concat_ws. How can I get the number of values ​​in this list? Here is my code:

Dataset<Row> ds = df.groupBy("COL1").agg(org.apache.spark.sql.functions
    .concat_ws(",",org.apache.spark.sql.functions.collect_list("COL2")).as("sample"));
+2
source share
1 answer

Use size .

size (e: Column): Column Returns the length of an array or map.


The following example is presented in Scala, and I leave it to you to convert it to Java, but the general idea is exactly the same, regardless of the programming language.

val input = spark.range(4)
  .withColumn("COL1", $"id" % 2)
  .select($"COL1", $"id" as "COL2")
scala> input.show
+----+----+
|COL1|COL2|
+----+----+
|   0|   0|
|   1|   1|
|   0|   2|
|   1|   3|
+----+----+

val s = input
  .groupBy("COL1")
  .agg(
    concat_ws(",", collect_list("COL2")) as "concat",
    size(collect_list("COL2")) as "size") // <-- size
scala> s.show
+----+------+----+
|COL1|concat|size|
+----+------+----+
|   0|   0,2|   2|
|   1|   1,3|   2|
+----+------+----+

In Java, this will be next. Thanks to Krishna Prasad for sharing the code with the SO / Spark community!

Dataset<Row> ds = df.groupBy("COL1").agg(
  org.apache.spark.sql.functions.concat_ws(",",org.apache.spark.sql.functions.collect_list("‌​COL2")).as("sample")‌​, 
  org.apache.spark.sql.functions.size(org.apache.spark.sql.functions.collect_list("COL2‌​")).as("size"));
+3
source

Source: https://habr.com/ru/post/1689885/


All Articles