Suppose I have a DataFrame, for example:
val json = sc.parallelize(Seq("""{"a":1, "b":2, "c":22, "d":34}""","""{"a":3, "b":9, "c":22, "d":12}""","""{"a":1, "b":4, "c":23, "d":12}"""))
val df = sqlContext.read.json(json)
I want to remove duplicate rows for column "a" based on the value of column "b". those. if there are duplicate rows for column “a”, I want to keep the value with a larger value for “b”. For the above example, after processing, I only need
{"a": 3, "b": 9, "c": 22, "d": 12}
and
{"a": 1, "b": 4, "c": 23, "d": 12}
The Spark DataFrame dropDuplicates API does not seem to support this. With the RDD approach I can do map().reduceByKey(), but what specific DataFrame operation should do this?
Appreciate some help, thanks.