There are several ways to make aggregate functions in a spark,
val client = Seq((1,"A",10),(2,"A",5),(3,"B",56)).toDF("ID","Categ","Amnt")
one.
val aggdf = client.groupBy('Categ).agg(Map("ID"->"count","Amnt"->"sum")) +-----+---------+---------+ |Categ|count(ID)|sum(Amnt)| +-----+---------+---------+ |B |1 |56 | |A |2 |15 | +-----+---------+---------+ //Rename and sort as needed. aggdf.sort('Categ).withColumnRenamed("count(ID)","Count").withColumnRenamed("sum(Amnt)","sum") +-----+-----+---+ |Categ|Count|sum| +-----+-----+---+ |A |2 |15 | |B |1 |56 | +-----+-----+---+
2.
import org.apache.spark.sql.functions._ client.groupBy('Categ).agg(count("ID").as("count"),sum("Amnt").as("sum")) +-----+-----+---+ |Categ|count|sum| +-----+-----+---+ |B |1 |56 | |A |2 |15 | +-----+-----+---+
3.
import com.google.common.collect.ImmutableMap; client.groupBy('Categ).agg(ImmutableMap.of("ID", "count", "Amnt", "sum")) +-----+---------+---------+ |Categ|count(ID)|sum(Amnt)| +-----+---------+---------+ |B |1 |56 | |A |2 |15 | +-----+---------+---------+ //Use column rename is required.
4. If you are an expert in SQL, you can do it too
client.createOrReplaceTempView("df") val aggdf = spark.sql("select Categ, count(ID),sum(Amnt) from df group by Categ") aggdf.show() +-----+---------+---------+ |Categ|count(ID)|sum(Amnt)| +-----+---------+---------+ | B| 1| 56| | A| 2| 15| +-----+---------+---------+
source share