Parse Dataset Json Column for Dataset <Row>
Having a Dataset<Row> one json row column:
+--------------------+ | value| +--------------------+ |{"Context":"00AA0...| +--------------------+ Json example:
{"Context":"00AA00AA","MessageType":"1010","Module":"1200"} How can I most effectively get a Dataset<Row> that looks like this:
+--------+-----------+------+ | Context|MessageType|Module| +--------+-----------+------+ |00AA00AA| 1010| 1200| +--------+-----------+------+ I process this data in a stream, I know that a spark can do it myself when I read it from a file:
spark .readStream() .schema(MyPojo.getSchema()) .json("src/myinput") but now i am reading data from kafka and it gives me data in a different form. I know I can use some parsers like Gson, but I would like the spark to do this for me.
+5
1 answer
Try this sample.
public class SparkJSONValueDataset { public static void main(String[] args) { SparkSession spark = SparkSession .builder() .appName("SparkJSONValueDataset") .config("spark.sql.warehouse.dir", "/file:C:/temp") .master("local") .getOrCreate(); //Prepare data Dataset<Row> List<String> data = Arrays.asList("{\"Context\":\"00AA00AA\",\"MessageType\":\"1010\",\"Module\":\"1200\"}"); Dataset<Row> df = spark.createDataset(data, Encoders.STRING()).toDF().withColumnRenamed("_1", "value"); df.show(); //convert to Dataset<String> and Read Dataset<String> df1 = df.as(Encoders.STRING()); Dataset<Row> df2 = spark.read().json(df1.javaRDD()); df2.show(); spark.stop(); } } +1