I have Dataset<Row>three columns and I want to change to create a linear regression.
My colums [id , x, y], and I want linear regression for each id;
For instance:
[1 , 1005, 0.29]
[1 , 1006, 0.46]
[1 , 1007, 0.29]
[2 , 1000, 0.68]
[2 , 1010, 0.50]
How can I create LabeledPoint from this data?
Do I need my data this way ?:
(0.29, (1, [1005,1007]))
(0.46, (1, [1006]))
(0.68, (2, [1000]))
(0.50, (2, [1010]))
I know how to change to this point:
JavaRDD<Row> datardd = dataset.toJavaRDD();
JavaPairRDD<Integer, Tuple2<Double,Double>> datapairrdd =
datardd.mapToPair(new PairFunction<Row, Integer, Tuple2<Double, Double>>(){
@Override
public Tuple2<Integer, Tuple2<Double, Double>> call(Row row) throws Exception {
return new Tuple2<>(new Integer(row.getString(0)), new Tuple2<>(new Double(row.getString(1)), new Double(row.getString(2))));
}
});
JavaPairRDD<Integer, Iterable<Tuple2<Double, Double>>> data = pairrdd.groupByKey();
So my data is now:
(1, [(1005,0.29), (1006,0.46), (1007, 0.29)])
(2, [(1000,0.68), (1010,0.50)])
But I'm stuck from here ...