Work on neighboring elements in RDD in Spark

Question

Work on neighboring elements in RDD in Spark

Since I have a collection:

List(1, 3,-1, 0, 2, -4, 6)

It is easy to make it sorted as:

List(-4, -1, 0, 1, 2, 3, 6)

Then I can build a new collection by computing 6 - 3, 3 - 2, 2 - 1, 1 - 0, and so on:

for(i <- 0 to list.length -2) yield {
    list(i + 1) - list(i)
}

and get the vector:

Vector(3, 1, 1, 1, 1, 3)

That is, I want to make the next element minus the current element.

But how to implement this in RDD on Spark?

I know for the collection:

List(-4, -1, 0, 1, 2, 3, 6)

There will be some sections of the collection, each section is ordered, can I do a similar operation on each section and collect the results on each section together?

+5

scala apache-spark

xring Dec 08 '15 at 2:23

source share

2 answers

Suppose you have something like

val seq = sc.parallelize(List(1, 3, -1, 0, 2, -4, 6)).sortBy(identity)

, ,

val original = seq.zipWithIndex.map(_.swap)

, .

val shifted = original.map { case (idx, v) => (idx - 1, v) }.filter(_._1 >= 0)

,

val diffs = original.join(shifted)
      .sortBy(_._1, ascending = false)
      .map { case (idx, (v1, v2)) => v2 - v1 }

So

 println(diffs.collect.toSeq)

WrappedArray(3, 1, 1, 1, 1, 3)

, sortBy, .

, :

val elems = List(1, 3, -1, 0, 2, -4, 6).sorted  

(elems.tail, elems).zipped.map(_ - _).reverse

RDD zip , . , tail

val tail = seq.zipWithIndex().filter(_._2 > 0).map(_._1)

tail.zip(seq) , , , .

+2

Odomontois 08 . '15 7:21

zero323 · Accepted Answer · 2015-12-08T11:46:57+0000

The most effective solution is to use the method sliding:

import org.apache.spark.mllib.rdd.RDDFunctions._

val rdd = sc.parallelize(Seq(1, 3,-1, 0, 2, -4, 6))
  .sortBy(identity)
  .sliding(2)
  .map{case Array(x, y) => y - x}

Work on neighboring elements in RDD in Spark

More articles: