How to access multiple json files using dataframe from S3

Question

How to access multiple json files using dataframe from S3

I am using apapche spark. I want to access multiple json files from spark by date. How can I select multiple files, that is, I want to provide a range so that files ending in 1034.json, before files ending in 1434.json. I am trying to do this.

DataFrame df = sql.read().json("s3://..../..../.....-.....[1034*-1434*]");

But I get the following error

   at java.util.regex.Pattern.error(Pattern.java:1924)
    at java.util.regex.Pattern.range(Pattern.java:2594)
    at java.util.regex.Pattern.clazz(Pattern.java:2507)
    at java.util.regex.Pattern.sequence(Pattern.java:2030)
    at java.util.regex.Pattern.expr(Pattern.java:1964)
    at java.util.regex.Pattern.compile(Pattern.java:1665)
    at java.util.regex.Pattern.<init>(Pattern.java:1337)
    at java.util.regex.Pattern.compile(Pattern.java:1022)
    at org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:156)
    at org.apache.hadoop.fs.GlobPattern.<init>(GlobPattern.java:42)
    at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:67)

Specify the exit path.

+4

apache-spark apache-spark-sql spark-dataframe

Hitesh goyal Nov 07 '16 at 12:49

source share

1 answer

Shankar · Accepted Answer · 2016-11-07T12:54:29+0000

You can read something like this.

sqlContext.read().json("s3n://bucket/filepath/*.json")

Alternatively, you can use wildcardsin the file path.

For instance:

sqlContext.read().json("s3n://*/*/*-*[1034*-1434*]")

How to access multiple json files using dataframe from S3

More articles: