How to calculate the amount and quantity in one group?

Based on the following DataFrame :

 val client = Seq((1,"A",10),(2,"A",5),(3,"B",56)).toDF("ID","Categ","Amnt") +---+-----+----+ | ID|Categ|Amnt| +---+-----+----+ | 1| A| 10| | 2| A| 5| | 3| B| 56| +---+-----+----+ 

I would like to get the ID number and total amount by category:

 +-----+-----+---------+ |Categ|count|sum(Amnt)| +-----+-----+---------+ | B| 1| 56| | A| 2| 15| +-----+-----+---------+ 

Can I make an account and the amount without participation?

 client.groupBy("Categ").count .join(client.withColumnRenamed("Categ","cat") .groupBy("cat") .sum("Amnt"), 'Categ === 'cat) .drop("cat") 

Maybe something like this:

 client.createOrReplaceTempView("client") spark.sql("SELECT Categ count(Categ) sum(Amnt) FROM client GROUP BY Categ").show() 
+12
source share
3 answers

I give a different example than yours

Several group functions are possible. try it accordingly

  // In 1.3.x, in order for the grouping column "department" to show up, // it must be included explicitly as part of the agg function call. df.groupBy("department").agg($"department", max("age"), sum("expense")) // In 1.4+, grouping column "department" is included automatically. df.groupBy("department").agg(max("age"), sum("expense")) 

 import org.apache.spark.sql.{DataFrame, SparkSession} import org.apache.spark.sql.functions._ import spark.implicits._ val client: DataFrame = spark.sparkContext.parallelize( Seq((1,"A",10),(2,"A",5),(3,"B",56)) ).toDF("ID","Categ","Amnt") client.groupBy("Categ").agg(sum("Amnt"),count("ID")).show() 

 +-----+---------+---------+ |Categ|sum(Amnt)|count(ID)| +-----+---------+---------+ | B| 56| 1| | A| 15| 2| +-----+---------+---------+ 
+15
source

You can perform aggregation as shown in the table below:

 client.groupBy("Categ").agg(sum("Amnt"),count("ID")).show() +-----+---------+---------+ |Categ|sum(Amnt)|count(ID)| +-----+---------+---------+ | A| 15| 2| | B| 56| 1| +-----+---------+---------+ 
+8
source

There are several ways to make aggregate functions in a spark,

 val client = Seq((1,"A",10),(2,"A",5),(3,"B",56)).toDF("ID","Categ","Amnt") 

one.

 val aggdf = client.groupBy('Categ).agg(Map("ID"->"count","Amnt"->"sum")) +-----+---------+---------+ |Categ|count(ID)|sum(Amnt)| +-----+---------+---------+ |B |1 |56 | |A |2 |15 | +-----+---------+---------+ //Rename and sort as needed. aggdf.sort('Categ).withColumnRenamed("count(ID)","Count").withColumnRenamed("sum(Amnt)","sum") +-----+-----+---+ |Categ|Count|sum| +-----+-----+---+ |A |2 |15 | |B |1 |56 | +-----+-----+---+ 

2.

 import org.apache.spark.sql.functions._ client.groupBy('Categ).agg(count("ID").as("count"),sum("Amnt").as("sum")) +-----+-----+---+ |Categ|count|sum| +-----+-----+---+ |B |1 |56 | |A |2 |15 | +-----+-----+---+ 

3.

 import com.google.common.collect.ImmutableMap; client.groupBy('Categ).agg(ImmutableMap.of("ID", "count", "Amnt", "sum")) +-----+---------+---------+ |Categ|count(ID)|sum(Amnt)| +-----+---------+---------+ |B |1 |56 | |A |2 |15 | +-----+---------+---------+ //Use column rename is required. 

4. If you are an expert in SQL, you can do it too

 client.createOrReplaceTempView("df") val aggdf = spark.sql("select Categ, count(ID),sum(Amnt) from df group by Categ") aggdf.show() +-----+---------+---------+ |Categ|count(ID)|sum(Amnt)| +-----+---------+---------+ | B| 1| 56| | A| 2| 15| +-----+---------+---------+ 
0
source

Source: https://habr.com/ru/post/1012075/


All Articles