How to create a spark DataFrame from sequenceFile

Question

How to create a spark DataFrame from sequenceFile

I work with spark 1.5. I want to create dataframefrom files in HDFS. HDFS files contain data jsonwith a large number of fields in the format of the input sequence file.

Is there any way to do this elegantly in java? I don't know json structure / fields in advance.

I can take input from a sequence file as an RDD as follows:

JavaPairRDD<LongWritable,BytesWritable> inputRDD = jsc.sequenceFile("s3n://key_id:secret_key@file/path", LongWritable.class, BytesWritable.class);
JavaRDD<String> events = inputRDD.map(
    new Function<Tuple2<LongWritable,BytesWritable>, String>() {
        public String call(Tuple2<LongWritable,BytesWritable> tuple) {
            return Text.decode(tuple._2.getBytes());
        }
    }
);

How to create a data frame from this RDD?

+4

hadoop hdfs apache-spark

nish 21 sept '15 at 14:23

source share

1 answer

nish · Accepted Answer · 2015-10-08T17:55:31+0000

I did the following for json data in my sequence files:

    JavaRDD<String> events = inputRDD.map(
    new Function<Tuple2<LongWritable,BytesWritable>, String>() {
        public String call(Tuple2<LongWritable,BytesWritable> tuple) throws JSONException, UnsupportedEncodingException {
            String valueAsString = new String(tuple._2.getBytes(), "UTF-8");
            JSONObject data = new JSONObject(valueAsString);
            JSONObject payload = new JSONObject(data.getString("payload"));
            String atlas_ts = "";
            return payload.toString();
        }
    }
    );

How to create a spark DataFrame from sequenceFile

More articles: