Spark: how to create a file path for reading from s3 using scala

How to create and upload multiple s3 file paths in scala so that I can use:

sqlContext.read.json ("s3://..../*/*/*") 

I know I can use wildcards to read multiple files, but is there any way that I can generate the path? For example, my fIle structure looks like this: BucketName / year / month / day / files

  s3://testBucket/2016/10/16/part00000 

These files are all jsons. The problem is that I only need to download the spatial duration of the files, for example. Say 16 days, then I need to upload files to start the day (October 16): from 1 to 16.

With a 28-day duration for the same start day, I would like to read September 18th

Can someone tell me any ways to do this?

+1
source share
2 answers

You can see this answer , you can specify a whole directories , use wildcards and even CSV of directories and wildcards . For instance:.

 sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file") 

Or you can use the AWS API to get a list of files locations and read these files with a spark.

You can find the answer for finding AWS S3 files.

+1
source

You can create a list of paths separated by commas: sqlContext.read.json (s3: // testBucket / 2016/10/16 /, s3: // testBucket / 2016/10/15 /, ...);

0
source

Source: https://habr.com/ru/post/1011872/


All Articles