Is it possible to create nested RDDs in Apache Spark?

I am trying to implement the K-nearest neighbor algorithm in Spark. I was wondering if it is possible to work with nested RDDs. It will make my life easier. Consider the following code snippet.

public static void main (String[] args){ //blah blah code JavaRDD<Double> temp1 = testData.map( new Function<Vector,Double>(){ public Double call(final Vector z) throws Exception{ JavaRDD<Double> temp2 = trainData.map( new Function<Vector, Double>() { public Double call(Vector vector) throws Exception { return (double) vector.length(); } } ); return (double)z.length(); } } ); } 

I am currently getting an error with these sub-settings (I can post the full log here). Is this allowed in the first place? Thanks

+2
source share
2 answers

No, this is not possible because RDD elements must be serializable and RDD not serializable. And that makes sense, otherwise you can send an entire RDD over the network, which is a problem if it contains a lot of data. And if it does not contain a lot of data, you can use an array or something like that.

However, I don’t know how you implement the nearest neighbor K ... but be careful: if you do something like calculating the distance between each point, it does not actually scale in the size of the data set, because it is O (n2).

+3
source

I came across a nullpointer exception when trying something like this. Since we cannot perform operations with RDD in RDD.

Spark does not support RDD deployment, the reason is that to perform an operation or create a new RDD spark runtime, access to the sparkcontext object, which is available only in the driver machine, is required.

Therefore, if you want to work with nested RDDs, you can assemble the parent RDD on the node driver and then iterate over it using an array or something like that.

Note. - The RDD class is serializable. See below.

enter image description here

+1
source

Source: https://habr.com/ru/post/1012119/


All Articles