How to temporarily save a column as a json object to output other columns?

I have a data set with key pairs of values ​​like this

likes=dogs;hates=birds;likes=sports;eats=cheese 

Then i will turn it into json

 {"likes": ["dogs","sports"], "hates": ["birds"], "eats": ["cheese"]} 

Is there a way to save this json data structure without dropping it onto a row, so I can get more columns from it row by row? I would like it to look something like this, without having to decrypt json from the row in each column.

  Dataset<Row> df1 = df.withColumn("interests", callUDF("to_json", col("interests"))) .withColumn("likes", callUDF("extract_from_json", "likes", col("interests"))) .withColumn("hates", callUDF("extract_from_json", "hates", col("interests"))) .withColumn("hates", callUDF("extract_from_json", "eats", col("interests"))); 
+5
source share
1 answer

If you are working with the source file

 likes=dogs;hates=birds;likes=sports;eats=cheese 

then you can read it with sc.textFile, then do some simple manipulations with RDD.

 val df = sc.textFile(file) .flatMap(x => x.split(";")) .map(x => (x.split("=")(0), x.split("=")(1))) .toDF("interest","value") df.withColumn("tmp",lit(1)).groupBy("tmp").pivot("interest").agg(collect_list("value")) +---+--------+-------+--------------+ |tmp| eats| hates| likes| +---+--------+-------+--------------+ | 1|[cheese]|[birds]|[dogs, sports]| +---+--------+-------+--------------+ 
+3
source

Source: https://habr.com/ru/post/1273015/


All Articles