You can achieve this in several ways.
While reading, you can provide a schema for data to read json, or you can let sparks output the schema on its own.
Once json is in the data frame, you can perform the following ways to smooth it out.
a. Using explode () in a dataframe - smooth it out. b. Using spark sql and access to nested fields. operator. You can find examples here.
Finally, if you want to add new columns to dataframe a. The first option, using withColumn (), is one approach. However, this will be done for each added column and for the entire dataset. b. Using sql to generate a new data frame from existing ones - this may be easiest with. Finally, using the map, then accessing the elements, getting the old scheme, add new values, create a new scheme and finally get a new df - as shown below
One withColumn will work on all rdd. Therefore, it is usually not recommended to use a method for each column that you want to add. There is a way that you work with columns and their data inside a map function. Since one map function is used here, the code to add a new column and its data will be executed in parallel.
a. you can collect new values based on calculations
b. Add these new column values to main rdd as shown below
val newColumns: Seq[Any] = Seq(newcol1,newcol2) Row.fromSeq(row.toSeq.init ++ newColumns)
Here is the string, this is the link to the string in the map method
with. Create a new diagram below
val newColumnsStructType = StructType{Seq(new StructField("newcolName1",IntegerType),new StructField("newColName2", IntegerType))
d. Add to the old scheme
val newSchema = StructType(mainDataFrame.schema.init ++ newColumnsStructType)
e. Creating a new data frame with new columns
val newDataFrame = sqlContext.createDataFrame(newRDD, newSchema)