How to specify only specific fields using read.schema in JSON: SPARK Scala

Question

How to specify only specific fields using read.schema in JSON: SPARK Scala

I am trying to programmatically apply a schema (json) in a textFile that looks like json. I tried with jsonFile, but the problem is creating data from a list of json files, the spark must make 1 pass through the data to create a schema for the DataFrame. Thus, it needs to parse all the data that takes longer (4 hours after my data is stuck and the size of a TB). Therefore, I want to try to read it as a textFile and actuate the circuit to get the fields of interest for the subsequent query in the resulting data frame. But I'm not sure how to match it with input. Can someone give me some link on how to map the schema to json as input.

:

This is the complete diagram:

records: org.apache.spark.sql.DataFrame = [country: string, countryFeatures: string, customerId: string, homeCountry: string, homeCountryFeatures: string, places: array<struct<freeTrial:boolean,placeId:string,placeRating:bigint>>, siteName: string, siteId: string, siteTypeId: string, Timestamp: bigint, Timezone: string, countryId: string, pageId: string, homeId: string, pageType: string, model: string, requestId: string, sessionId: string, inputs: array<struct<inputName:string,inputType:string,inputId:string,offerType:string,originalRating:bigint,processed:boolean,rating:bigint,score:double,methodId:string>>]

But I'm only interested in a few fields like:

res45: Array[String] = Array({"requestId":"bnjinmm","siteName":"bueller","pageType":"ad","model":"prepare","inputs":[{"methodId":"436136582","inputType":"US","processed":true,"rating":0,"originalRating":1},{"methodId":"23232322","inputType":"UK","processed":falase,"rating":0,"originalRating":1}]


 val  records = sc.textFile("s3://testData/sample.json.gz")

  val schema = StructType(Array(StructField("requestId",StringType,true),
                          StructField("siteName",StringType,true),
                          StructField("model",StringType,true),
                          StructField("pageType",StringType,true),
                          StructField("inputs", ArrayType(
                                StructType(
                                            StructField("inputType",StringType,true), 
                                            StructField("originalRating",LongType,true), 
                                            StructField("processed",BooleanType,true), 
                                            StructField("rating",LongType,true), 
                                            StructField("methodId",StringType,true)
                                            ),true),true)))

    val rowRDD = ?? 

    val inputRDD = sqlContext.applySchema(rowRDD, schema)
    inputRDD.registerTempTable("input")

     sql("select * from input").foreach(println)

Is there any way to match this? Or I need to use a parser or something else. I want to use textFile just because of limitations.

Tried using:

val  records =sqlContext.read.schema(schema).json("s3://testData/test2.gz")

But continuing to get the error:

<console>:37: error: overloaded method value apply with alternatives:
     (fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
      (fields: java.util.List[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
      (fields: Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
     cannot be applied to (org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField)
           StructField("inputs",ArrayType(StructType(StructField("inputType",StringType,true), StructField("originalRating",LongType,true), StructField("processed",BooleanType,true), StructField("rating",LongType,true), StructField("score",DoubleType,true), StructField("methodId",StringType,true)),true),true)))
                                              ^

0

json scala apache-spark rdd

user4479371 Jul 9 '16 at 1:14

source share

1 answer

Rockie Yang · Accepted Answer · 2016-07-11T23:02:58+0000

It can be loaded using the following code with a predefined scheme, the spark should not pass through the file in the ZIP file. The code in the question is ambiguous.

import org.apache.spark.sql.types._

val input = StructType(
                Array(
                    StructField("inputType",StringType,true), 
                    StructField("originalRating",LongType,true), 
                    StructField("processed",BooleanType,true), 
                    StructField("rating",LongType,true), 
                    StructField("score",DoubleType,true), 
                    StructField("methodId",StringType,true)
                )
            )

 val schema = StructType(Array(
    StructField("requestId",StringType,true),
    StructField("siteName",StringType,true),
    StructField("model",StringType,true),
    StructField("inputs",
        ArrayType(input,true),
                true)
    )
)

val  records =sqlContext.read.schema(schema).json("s3://testData/test2.gz")

Not all fields must be provided. Although it is good to provide everything if possible.

Spark parses everything best if some string is invalid. It will add _corrupt_record as a column that contains the entire row. Although, if he posted the json file file.

How to specify only specific fields using read.schema in JSON: SPARK Scala

More articles: