How to encode a data set for linear regression in Spark with Java?

I have Dataset<Row>three columns and I want to change to create a linear regression.

My colums [id , x, y], and I want linear regression for each id;

For instance:

[1 , 1005, 0.29]   
[1 , 1006, 0.46]  
[1 , 1007, 0.29]
[2 , 1000, 0.68]
[2 , 1010, 0.50]

How can I create LabeledPoint from this data?

Do I need my data this way ?:

(0.29, (1, [1005,1007]))
(0.46, (1, [1006]))
(0.68, (2, [1000]))
(0.50, (2, [1010]))

I know how to change to this point:

JavaRDD<Row> datardd = dataset.toJavaRDD();
JavaPairRDD<Integer, Tuple2<Double,Double>> datapairrdd =
           datardd.mapToPair(new PairFunction<Row, Integer, Tuple2<Double, Double>>(){
            @Override
            public Tuple2<Integer, Tuple2<Double, Double>> call(Row row) throws Exception {
                    return new Tuple2<>(new Integer(row.getString(0)), new Tuple2<>(new Double(row.getString(1)), new Double(row.getString(2))));
            }
        });
JavaPairRDD<Integer, Iterable<Tuple2<Double, Double>>> data = pairrdd.groupByKey();

So my data is now:

    (1, [(1005,0.29), (1006,0.46), (1007, 0.29)])
    (2, [(1000,0.68), (1010,0.50)])

But I'm stuck from here ...

+4
source share

Source: https://habr.com/ru/post/1650895/


All Articles