You can do something like this:
val df = Seq( (1,12345678,"this is a test"), (1,23456789, "another test"), (2,2345678,"2nd test"), (2,1234567, "2nd another test") ).toDF("id","timestamp","data") +---+---------+----------------+ | id|timestamp| data| +---+---------+----------------+ | 1| 12345678| this is a test| | 1| 23456789| another test| | 2| 2345678| 2nd test| | 2| 1234567|2nd another test| +---+---------+----------------+ df.join( df.groupBy($"id").agg(max($"timestamp") as "r_timestamp").withColumnRenamed("id", "r_id"), $"id" === $"r_id" && $"timestamp" === $"r_timestamp" ).drop("r_id").drop("r_timestamp").show +---+---------+------------+ | id|timestamp| data| +---+---------+------------+ | 1| 23456789|another test| | 2| 2345678| 2nd test| +---+---------+------------+
If you expect that there might be a repeated timestamp for id (see comments below), you can do this:
df.dropDuplicates(Seq("id", "timestamp")).join( df.groupBy($"id").agg(max($"timestamp") as "r_timestamp").withColumnRenamed("id", "r_id"), $"id" === $"r_id" && $"timestamp" === $"r_timestamp" ).drop("r_id").drop("r_timestamp").show
source share