Spark: how to make dropDuplicates on a data frame while maintaining the highest time line

Question

Spark: how to make dropDuplicates on a data frame while maintaining the highest time line

I have a use case when I need to delete duplicate rows of a data frame (in this case, a duplicate means that they have the same "id" field), keeping the line with the highest word timestamp (unix timestamp).

I found the drop_duplicate method (I use pyspark), but I have no control over which item will be saved.

Can anybody help? thanks in advance

+5

dataframe apache-spark pyspark spark-dataframe

arnaud briche Apr 14 '16 at 12:59

source share

2 answers

You can do something like this:

 val df = Seq( (1,12345678,"this is a test"), (1,23456789, "another test"), (2,2345678,"2nd test"), (2,1234567, "2nd another test") ).toDF("id","timestamp","data") +---+---------+----------------+ | id|timestamp| data| +---+---------+----------------+ | 1| 12345678| this is a test| | 1| 23456789| another test| | 2| 2345678| 2nd test| | 2| 1234567|2nd another test| +---+---------+----------------+ df.join( df.groupBy($"id").agg(max($"timestamp") as "r_timestamp").withColumnRenamed("id", "r_id"), $"id" === $"r_id" && $"timestamp" === $"r_timestamp" ).drop("r_id").drop("r_timestamp").show +---+---------+------------+ | id|timestamp| data| +---+---------+------------+ | 1| 23456789|another test| | 2| 2345678| 2nd test| +---+---------+------------+

If you expect that there might be a repeated timestamp for id (see comments below), you can do this:

 df.dropDuplicates(Seq("id", "timestamp")).join( df.groupBy($"id").agg(max($"timestamp") as "r_timestamp").withColumnRenamed("id", "r_id"), $"id" === $"r_id" && $"timestamp" === $"r_timestamp" ).drop("r_id").drop("r_timestamp").show

+2

David Griffin Apr 14 '16 at 13:54

source share

David · Accepted Answer · 2016-04-14T13:40:01+0000

Manual functionality and pruning may be required to provide the required functionality.

def selectRowByTimeStamp(x,y): if x.timestamp > y.timestamp: return x return y dataMap = data.map(lambda x: (x.id, x)) uniqueData = dataMap.reduceByKey(selectRowByTimeStamp)

Here we group all the data based on id. Then, when we reduce the groupings, we do this by keeping the record with the highest timestamp. When the code is reduced, only 1 entry will be left for each identifier.

Spark: how to make dropDuplicates on a data frame while maintaining the highest time line

More articles: