How to get the n-th row of Spark RDD?

Question

How to get the n-th row of Spark RDD?

Suppose I have an RDD of arbitrary objects. I want to get the 10th (say) RDD row. How can I do it? One way is to use rdd.take (n) and then access to the nth element is an object, but this approach is slow when n is large.

+6

hadoop apache-spark rdd

user1742188 Jan 7 '15 at 18:30

source share

2 answers

I have not tested this for huge data. But it works great for me.

Let's say n = 2, I want to access the 2nd element,

  data.take(2).drop(1)

+2

Jack daniel Aug 23 '16 at 9:14

source share

Nicola ferraro · Accepted Answer · 2015-01-07T18:48:20+0000

I don’t know how effective this is, as it depends on the current and future optimization in the Spark engine, but you can try the following:

rdd.zipWithIndex.filter(_._2==9).map(_._1).first()

The first function converts RDD into a pair (value, idx) with idx starting at 0. The second function takes an element with idx == 9 (10th). The third function takes its original value. Then the result is returned.

The first function can be activated by the execution mechanism and influence the behavior of the entire processing. Give it a try.

In any case, if n is very large , this method is effective in that it does not need to assemble an array of the first n elements in the node driver.

How to get the n-th row of Spark RDD?

More articles: