How to turn a well-known structured RDD into a vector

Question

How to turn a well-known structured RDD into a vector

Assuming I have an RDD containing (Int, Int) tuples. I want to turn it into a vector, where the first Int in the tuple is the index and the second is the value.

Any idea how I can do this?

I am updating my question and adding my solution to clarify: My RDD is already reduced by key, and the number of keys is known. I want a vector for updating a single battery instead of multiple batteries.

There for my final decision was:

reducedStream.foreachRDD(rdd => rdd.collect({case (x: Int,y: Int) => { val v = Array(0,0,0,0) v(x) = y accumulator += new Vector(v) }}))

Using Vector from the battery example in the documentation.

+5

scala vector distributed-computing apache-spark rdd

Noam shaish Dec 18 '14 at 21:00

source share

2 answers

One key thing: do you really need a vector? A map could be much more suitable.

If you really need a local vector, you first need to use .collect (), and then do local conversions to Vector. Of course, you will have enough memory for this. But here the real problem is to find a vector that can be built efficiently from pairs (index, value). As far as I know, Spark MLLib has an org.apache.spark.mllib.linalg.Vectors class that can create Vector from an array of indexes and values, and even from tuples. Under the hood, breeze.linalg used. So you probably should have started.
If you just need an order, you can use .orderByKey() since you already have RDD[(K,V)] . So you ordered the stream. Which does not strictly monitor your intentions, but perhaps this may come even better. Now you can discard elements with the same .reduceByKey() key, creating only the resulting elements.
Finally, if you really need a large vector, make a .orderByKey , and then you can create a real vector by executing .flatmap() , which maintains a counter and discards more than one element for the same index / insert the required number of 'default' elements for missing indexes.

Hope this is clear enough.

+3

Roman Nikitchenko Dec 18 '14 at 22:20

source share

The archetypal paul · Accepted Answer · 2014-12-18T21:54:24+0000

 rdd.collectAsMap.foldLeft(Vector[Int]()){case (acc, (k,v)) => acc updated (k, v)}

Turn the RDD onto the map. Then repeat this, creating a vector as we go.

You can use justt collect (), but if there are many repetitions of tuples with the same key, which may not fit in memory.

How to turn a well-known structured RDD into a vector

More articles: