Intrinsically safe data group several times

val df = (Seq((1, "a", "10"),(1,"b", "12"),(1,"c", "13"),(2, "a", "14"), (2,"c", "11"),(1,"b","12" ),(2, "c", "12"),(3,"r", "11")). toDF("col1", "col2", "col3")) 

So, I have a spark data block with three columns.

My requirement is that I actually need to complete two levels of groupby, as described below.

Level1: If I do groupby on col1 and do the sum of Col3. I will get below two columns. 1. col1 2. sum (col3) I will lose col2 here.

Level2: If I want to group again by col1 and col2 and make the sum of Col3, I will get below 3 columns. 1. col1 2. col2 3. sum (col3)

Actually I need to execute two levels of groupBy and have these two columns (sum (col3) of level 1, sum (col3) of level 2) in the last one data frame.

How can I do this, can someone explain?

spark: 1.6.2 Scala: 2.10

+6
source share
1 answer

One option is to make two amounts separately, and then attach them back:

 (df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")). join(df.groupBy("col1").agg(sum($"col3").as("sum_level1")), Seq("col1")).show) +----+----+----------+----------+ |col1|col2|sum_level2|sum_level1| +----+----+----------+----------+ | 2| c| 23.0| 37.0| | 2| a| 14.0| 37.0| | 1| c| 13.0| 47.0| | 1| b| 24.0| 47.0| | 3| r| 11.0| 11.0| | 1| a| 10.0| 47.0| +----+----+----------+----------+ 

Another option is to use window functions, given the fact that level1_sum is the sum of level2_sum grouped by col1 :

 import org.apache.spark.sql.expressions.Window val w = Window.partitionBy($"col1") (df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")). withColumn("sum_level1", sum($"sum_level2").over(w)).show) +----+----+----------+----------+ |col1|col2|sum_level2|sum_level1| +----+----+----------+----------+ | 1| c| 13.0| 47.0| | 1| b| 24.0| 47.0| | 1| a| 10.0| 47.0| | 3| r| 11.0| 11.0| | 2| c| 23.0| 37.0| | 2| a| 14.0| 37.0| +----+----+----------+----------+ 
+8
source

Source: https://habr.com/ru/post/1014266/


All Articles