How to get the n-th row of Spark RDD?

Suppose I have an RDD of arbitrary objects. I want to get the 10th (say) RDD row. How can I do it? One way is to use rdd.take (n) and then access to the nth element is an object, but this approach is slow when n is large.

+6
source share
2 answers

I don’t know how effective this is, as it depends on the current and future optimization in the Spark engine, but you can try the following:

rdd.zipWithIndex.filter(_._2==9).map(_._1).first() 

The first function converts RDD into a pair (value, idx) with idx starting at 0. The second function takes an element with idx == 9 (10th). The third function takes its original value. Then the result is returned.

The first function can be activated by the execution mechanism and influence the behavior of the entire processing. Give it a try.

In any case, if n is very large , this method is effective in that it does not need to assemble an array of the first n elements in the node driver.

+6
source

I have not tested this for huge data. But it works great for me.

Let's say n = 2, I want to access the 2nd element,

  data.take(2).drop(1) 
+2
source

Source: https://habr.com/ru/post/980667/


All Articles