How to combine two RDDs in Spark?

I have 2 JavaRDD. The first -

JavaRDD<CustomClass> data

and second -

JavaRDD<Vector> features

My custom class has 2 fields, (String) text and (int). I have 1000 instances of CustomClass in my JavaRDD data and 1000 instances of Vector in JavaRDD functions.

I calculated these 1000 vectors using JavaRDD data and applying a display function on it.

Now I want to have a new JavaRDD form

JavaRDD<LabeledPoint>

Since the LabeledPoint constructor requires a label and a vector, I cannot use a display function that has both CustomClass and Vector as an argument to a call function, because it takes only one argument.

Can someone tell me how to combine these two JavaRDDs and get a new one

JavaRDD<LabeledPoint> 

?

Here are some snippets of code that I wrote:

    Class CustomClass {
        String text; int label;
    }

    JavaRDD<CustomClass> data = getDataFromFile(filename);

    final HashingTF hashingTF = new HashingTF();
    final IDF idf = new IDF();
    final JavaRDD<Vector> td2 = data.map(
            new Function<CustomClass, Vector>() {
                @Override
                public Vector call(CustomClass cd) throws Exception {
                    Vector v = new DenseVector(hashingTF.transform(Arrays.asList(cd.getText().split(" "))).toArray());
                    return v;
                }
            }
    );

    final JavaRDD<Vector> features = idf.fit(td2).transform(td2);
+4
1

JavaRDD # zip:

RDD , - RDD, RDD .. RDD (, ).

JavaPairRDD<CustomClass,Vector> dataAndFeatures = data.zip(features);
// TODO dataAndFeatures.map to LabeledPoint instances

, td2 map data. df (== features?) transform IDFModel , .

+4

Source: https://habr.com/ru/post/1654680/


All Articles