Spark DataFrames - Key Acronym

Question

Spark DataFrames - Key Acronym

Say I have a data structure like this, where ts is some timestamp

case class Record(ts: Long, id: Int, value: Int)

Given the large number of these records, I want to end the record with the highest timestamp for each identifier. Using the RDD api, I think the following code does its job:

def findLatest(records: RDD[Record])(implicit spark: SparkSession) = {
  records.keyBy(_.id).reduceByKey{
    (x, y) => if(x.ts > y.ts) x else y
  }.values
}

Similarly, this is my attempt with datasets:

def findLatest(records: Dataset[Record])(implicit spark: SparkSession) = {
  records.groupByKey(_.id).mapGroups{
    case(id, records) => {
      records.reduceLeft((x,y) => if (x.ts > y.ts) x else y)
    }
  }
}

I am trying to figure out how to achieve something similar using dataframes, but to no avail - I understand that I can group with:

records.groupBy($"id")

But this gives me a RelationGroupedDataSet, and I don’t understand what aggregation function I need to write to achieve what I want - all the approximate clusters that I saw focus on returning only one column, aggregated, and not the whole row.

Is it possible to achieve this with data frames?

+4

scala apache-spark apache-spark-sql apache-spark-dataset

d80tb7 20 . '16 7:20

2

, Spark 2.1.1

final case class AggregateResultModel(id: String,
                                      mtype: String,
                                      healthScore: Int,
                                      mortality: Float,
                                      reimbursement: Float)
.....
.....

// assume that the rawScores are loaded behorehand from json,csv files

val groupedResultSet = rawScores.as[AggregateResultModel].groupByKey( item => (item.id,item.mtype ))
      .reduceGroups( (x,y) => getMinHealthScore(x,y)).map(_._2)


// the binary function used in the reduceGroups

def getMinHealthScore(x : AggregateResultModel, y : AggregateResultModel): AggregateResultModel = {
    // complex logic for deciding between which row to keep
    if (x.healthScore > y.healthScore) { return y }
    else if (x.healthScore < y.healthScore) { return x }
    else {

      if (x.mortality < y.mortality) { return y }
      else if (x.mortality > y.mortality) { return x }
      else  {

        if(x.reimbursement < y.reimbursement)
          return x
        else
          return y

      }

    }

  }

0

user238607 24 . '17 15:42

Assaf Mendelson · Accepted Answer · 2016-12-20T07:47:57+0000

argmax (. databricks)

, , df, , val, ts, - :

import org.apache.spark.sql.functions._
val newDF = df.groupBy('id).agg.max(struct('ts, 'val)) as 'tmp).select($"id", $"tmp.*")

Spark DataFrames - Key Acronym

More articles: