Unfortunately, both methods are incorrect before you start sorting, and there is no efficient and easy way to do this in Spark. However, the first one is significantly worse than the other.
Why are both ways wrong? Because it's just another groupByKey , and it's just an expensive operation. There are several ways to improve the situation (especially to avoid reducing the side of the card), but at the end of the day you just need to pay the price for a full shuffle, and if you don't see any glitches, it's probably not worth the hassle.
However, the second approach is much better algorithmically *. If you want to keep the sorted structure all the way, as in the first attempt, you should be allocated the tools ( aggregateByKey with bisect.insort would be a good choice), but nothing will work here.
If grouped output is a complex requirement, the best thing to do is keyBy , groupByKey and sort. This will not improve performance compared to the second solution, but may improve readability:
(flights .keyBy(lambda x: x[5]) .groupByKey() .mapValues(lambda vs: sorted(vs, key=lambda x: x[1:5])))
* Even if you assume the best scenario for Timsort , the first approach is N times O (N), and the second is O (N log N) in the worst case.
source share