How to sort RDD tuples with 5 elements in Spark Scala?

If I have RDD tuples with 5 elements, e.g. RDD (Double, String, Int, Double, Double)

How can I sort this RDD efficiently with the fifth element?

I tried to match this RDD with key-value pairs and used sortByKey, but it looks sortByKey rather slowly, it is slower than I built this RDD and used sortWith in the assembled array. Why is that?

Thank you very much.

+5
source share
3 answers

You can do this with sortBy acting directly on RDD :

 myRdd.sortBy(_._5) // Sort by 5th field of each 5-tuple 

There are additional optional parameters for determining the sort order ("increase") and the number of sections.

+8
source

sortByKey is just the distributed sorting API for Spark 1.0.

How much data are you trying to sort? A small amount will result in faster local / centralized sorting. If you try to sort GB and GB of data that might not even fit into a single node, then where Spark shines.

+1
source

If you want to sort in descending order, and if the corresponding element is of type int, you can use the “-” sign to sort the RDD in descending order.

For ex:

I have an RDD tuple with (String, Int). To sort this RDD by its 2nd element in descending order,

 rdd.sortBy(x => -x._2).collect().foreach(println); 

I have an RDD tuple with (String, String). To sort this RDD by its 2nd element in descending order,

 rdd.sortBy(x => x._2, false).collect().foreach(println); 
+1
source

Source: https://habr.com/ru/post/1233581/


All Articles