Spark: how to create a file path for reading from s3 using scala

Question

Spark: how to create a file path for reading from s3 using scala

How to create and upload multiple s3 file paths in scala so that I can use:

sqlContext.read.json ("s3://..../*/*/*")

I know I can use wildcards to read multiple files, but is there any way that I can generate the path? For example, my fIle structure looks like this: BucketName / year / month / day / files

  s3://testBucket/2016/10/16/part00000

These files are all jsons. The problem is that I only need to download the spatial duration of the files, for example. Say 16 days, then I need to upload files to start the day (October 16): from 1 to 16.

With a 28-day duration for the same start day, I would like to read September 18th

Can someone tell me any ways to do this?

+1

json filesystems scala amazon-s3 apache-spark

Learningner Oct 16 '16 at 7:35

source share

2 answers

You can create a list of paths separated by commas: sqlContext.read.json (s3: // testBucket / 2016/10/16 /, s3: // testBucket / 2016/10/15 /, ...);

0

Igor Berman Oct 16 '16 at 10:15

source share

p2. · Accepted Answer · 2016-10-16T13:00:02+0000

You can see this answer , you can specify a whole directories , use wildcards and even CSV of directories and wildcards . For instance:.

 sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")

Or you can use the AWS API to get a list of files locations and read these files with a spark.

You can find the answer for finding AWS S3 files.

Spark: how to create a file path for reading from s3 using scala

More articles: