Reading a Json file using Apache Spark

I am trying to read a Json file using Spark v2.0.0. In the case of a simple data code, it works very well. In case of small bit data, when I print df.show (), the data does not display correctly.

here is my code:

SparkSession session = SparkSession.builder().master("local").appName("jsonreader").getOrCreate();
Dataset<Row> list = session.read().json("/Users/hadoop/Desktop/sample.json");
list.show();

Here are my details:

{
    "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}

And my conclusion is similar:

+--------------------+
|     _corrupt_record|
+--------------------+
|                   {|
|       "glossary": {|
|        "title": ...|
|           "GlossDiv": {|
|            "titl...|
|               "GlossList": {|
|                "...|
|                 ...|
|                   "SortAs": "S...|
|                   "GlossTerm":...|
|                   "Acronym": "...|
|                   "Abbrev": "I...|
|                   "GlossDef": {|
|                 ...|
|                       "GlossSeeAl...|
|                 ...|
|                   "GlossSee": ...|
|                   }|
|                   }|
|                   }|
+--------------------+
only showing top 20 rows
+4
source share
4 answers

You need to format JSON on one line if you need to read this JSON. This is multi-line JSON and therefore cannot be read and loaded properly (one object, one row)

JSON API quote:

Loads a JSON file (one object per line) and returns the result as a DataFrame.

{"glossary":{"title":"example glossary","GlossDiv":{"title":"S","GlossList":{"GlossEntry":{"ID":"SGML","SortAs":"SGML","GlossTerm":"Standard Generalized Markup Language","Acronym":"SGML","Abbrev":"ISO 8879:1986","GlossDef":{"para":"A meta-markup language, used to create markup languages such as DocBook.","GlossSeeAlso":["GML","XML"]},"GlossSee":"markup"}}}}}

, ( , JSON)

scala> val df = spark.read.json("C:/DevelopmentTools/data.json")
df: org.apache.spark.sql.DataFrame = [glossary: struct<GlossDiv: struct<GlossList: struct<GlossEntry: struct<Abbrev: string, Acronym: string ... 5 more fields>>, title: string>, title: string>]

scala>

:

,

scala> df.select(df("glossary.GlossDiv.GlossList.GlossEntry.GlossTerm")).show()
+--------------------+
|           GlossTerm|
+--------------------+
|Standard Generali...|
+--------------------+


scala>

+5

, json , json, , , json, json, , , GlossDiv

SparkSession session = SparkSession.builder().master("local").appName("jsonreader").getOrCreate();
Dataset<Row> list = session.read().json("/Users/hadoop/Desktop/sample.json");
list.select("glossary.GlossDiv.title") .show
+3

Try:

session.read().json(session.sparkContext.wholeTextFiles("..."));
0

, , @user6022341. :

json wholeTextFiles (String path) , json-. . , hdfs://a-hdfs-path : part-00000 part-00001. sparkContext.wholeTextFiles( "hdfs://a-hdfs-path" ) , Spark JavaPairRDD, . .

json json-, , , , hasoop.Configuration, . .

If you needed to read a multi-line CSV file, you can do it with Spark 2.2

spark.read.csv(file, multiLine=True)
Run codeHide result

https://issues.apache.org/jira/browse/SPARK-19610

https://issues.apache.org/jira/browse/SPARK-20980

Hope this helps other people looking for similar information.

0
source

Source: https://habr.com/ru/post/1658654/


All Articles