Spark: how to make dropDuplicates on a data frame while maintaining the highest time line

I have a use case when I need to delete duplicate rows of a data frame (in this case, a duplicate means that they have the same "id" field), keeping the line with the highest word timestamp (unix timestamp).

I found the drop_duplicate method (I use pyspark), but I have no control over which item will be saved.

Can anybody help? thanks in advance

+5
source share
2 answers

Manual functionality and pruning may be required to provide the required functionality.

def selectRowByTimeStamp(x,y): if x.timestamp > y.timestamp: return x return y dataMap = data.map(lambda x: (x.id, x)) uniqueData = dataMap.reduceByKey(selectRowByTimeStamp) 

Here we group all the data based on id. Then, when we reduce the groupings, we do this by keeping the record with the highest timestamp. When the code is reduced, only 1 entry will be left for each identifier.

+6
source

You can do something like this:

 val df = Seq( (1,12345678,"this is a test"), (1,23456789, "another test"), (2,2345678,"2nd test"), (2,1234567, "2nd another test") ).toDF("id","timestamp","data") +---+---------+----------------+ | id|timestamp| data| +---+---------+----------------+ | 1| 12345678| this is a test| | 1| 23456789| another test| | 2| 2345678| 2nd test| | 2| 1234567|2nd another test| +---+---------+----------------+ df.join( df.groupBy($"id").agg(max($"timestamp") as "r_timestamp").withColumnRenamed("id", "r_id"), $"id" === $"r_id" && $"timestamp" === $"r_timestamp" ).drop("r_id").drop("r_timestamp").show +---+---------+------------+ | id|timestamp| data| +---+---------+------------+ | 1| 23456789|another test| | 2| 2345678| 2nd test| +---+---------+------------+ 

If you expect that there might be a repeated timestamp for id (see comments below), you can do this:

 df.dropDuplicates(Seq("id", "timestamp")).join( df.groupBy($"id").agg(max($"timestamp") as "r_timestamp").withColumnRenamed("id", "r_id"), $"id" === $"r_id" && $"timestamp" === $"r_timestamp" ).drop("r_id").drop("r_timestamp").show 
+2
source

Source: https://habr.com/ru/post/1247126/


All Articles