How to perform custom operations with GroupedData in Spark?

Question

How to perform custom operations with GroupedData in Spark?

I want to rewrite part of my code written using RDD to use DataFrames. It worked pretty smoothly until it found this:

 events
  .keyBy(row => (row.getServiceId + row.getClientCreateTimestamp + row.getClientId, row) )
  .reduceByKey((e1, e2) => if(e1.getClientSendTimestamp <= e2.getClientSendTimestamp) e1 else e2)
  .values

easy to start with

 events
  .groupBy(events("service_id"), events("client_create_timestamp"), events("client_id"))

what's next? What if I would like to iterate over all the elements in the current group? Is it possible? Thank you in advance.

+4

scala grouping apache-spark

homar Feb 07 '16 at 20:03

source share

1 answer

zero323 · Accepted Answer · 2016-02-07T20:35:29+0000

GroupedDatacannot be used directly. The data is not physically grouped, and this is just a logical operation. You must apply some variant of the method agg, for example:

events
 .groupBy($"service_id", $"client_create_timestamp", $"client_id")
 .min("client_send_timestamp")

or

events
 .groupBy($"service_id", $"client_create_timestamp", $"client_id")
 .agg(min($"client_send_timestamp"))

where client_send_timestampis the column you want to copy.

, , join - . Spark DataFrame

Spark , , - . Spark SQL?

Spark 2.0 +

Dataset.groupByKey, .

How to perform custom operations with GroupedData in Spark?

More articles: