I work with spark 1.5. I want to create dataframefrom files in HDFS. HDFS files contain data jsonwith a large number of fields in the format of the input sequence file.
Is there any way to do this elegantly in java? I don't know json structure / fields in advance.
I can take input from a sequence file as an RDD as follows:
JavaPairRDD<LongWritable,BytesWritable> inputRDD = jsc.sequenceFile("s3n://key_id:secret_key@file/path", LongWritable.class, BytesWritable.class);
JavaRDD<String> events = inputRDD.map(
new Function<Tuple2<LongWritable,BytesWritable>, String>() {
public String call(Tuple2<LongWritable,BytesWritable> tuple) {
return Text.decode(tuple._2.getBytes());
}
}
);
How to create a data frame from this RDD?
nish source
share