How to access multiple json files using dataframe from S3

I am using apapche spark. I want to access multiple json files from spark by date. How can I select multiple files, that is, I want to provide a range so that files ending in 1034.json, before files ending in 1434.json. I am trying to do this.

DataFrame df = sql.read().json("s3://..../..../.....-.....[1034*-1434*]");

But I get the following error

   at java.util.regex.Pattern.error(Pattern.java:1924)
    at java.util.regex.Pattern.range(Pattern.java:2594)
    at java.util.regex.Pattern.clazz(Pattern.java:2507)
    at java.util.regex.Pattern.sequence(Pattern.java:2030)
    at java.util.regex.Pattern.expr(Pattern.java:1964)
    at java.util.regex.Pattern.compile(Pattern.java:1665)
    at java.util.regex.Pattern.<init>(Pattern.java:1337)
    at java.util.regex.Pattern.compile(Pattern.java:1022)
    at org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:156)
    at org.apache.hadoop.fs.GlobPattern.<init>(GlobPattern.java:42)
    at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:67)

Specify the exit path.

+4
source share
1 answer

You can read something like this.

sqlContext.read().json("s3n://bucket/filepath/*.json")

Alternatively, you can use wildcardsin the file path.

For instance:

sqlContext.read().json("s3n://*/*/*-*[1034*-1434*]")
+2
source

Source: https://habr.com/ru/post/1660070/


All Articles