Spark removes duplicate rows from a DataFrame

Question

Spark removes duplicate rows from a DataFrame

Suppose I have a DataFrame, for example:

val json = sc.parallelize(Seq("""{"a":1, "b":2, "c":22, "d":34}""","""{"a":3, "b":9, "c":22, "d":12}""","""{"a":1, "b":4, "c":23, "d":12}"""))
val df = sqlContext.read.json(json)

I want to remove duplicate rows for column "a" based on the value of column "b". those. if there are duplicate rows for column “a”, I want to keep the value with a larger value for “b”. For the above example, after processing, I only need

{"a": 3, "b": 9, "c": 22, "d": 12}

and

{"a": 1, "b": 4, "c": 23, "d": 12}

The Spark DataFrame dropDuplicates API does not seem to support this. With the RDD approach I can do map().reduceByKey(), but what specific DataFrame operation should do this?

Appreciate some help, thanks.

+4

scala dataframe apache-spark apache-spark-sql

void 19 . '16 5:46

1

Pankaj Arora · Accepted Answer · 2016-02-19T06:16:48+0000

sparksql .

df.registerTempTable("x")
sqlContext.sql("SELECT a, b,c,d  FROM( SELECT *, ROW_NUMBER()OVER(PARTITION BY a ORDER BY b DESC) rn FROM x) y WHERE rn = 1").collect

, . suupport https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

Spark removes duplicate rows from a DataFrame

More articles: