Multiple files as input on Amazon Elastic MapReduce

Question

Multiple files as input on Amazon Elastic MapReduce

I am trying to start work on Elastic MapReduce (EMR) using a special jar. I am trying to process about 1000 files in one directory. When I submit my work with the s3n://bucketname/compressed/*.xml.gz , I get a "matching 0 files" error. If I transfer only the absolute path to the file (for example, s3n://bucketname/compressed/00001.xml.gz ), it works fine, but only one file is processed. I tried using the directory name ( s3n://bucketname/compressed/ ), hoping that the files inside would be processed, but that just passes the directory to the job.

At the same time, I have a small local installation. In this case, when I transfer my work using wildcards ( /path/to/dir/on/hdfs/*.xml.gz ), it works fine, and all 1000 files are listed correctly.

How do I get EMR to display all my files?

+6

java amazon-emr

Shashank agarwal Jul 20 '11 at 15:32

source share

1 answer

Arsen zahray · Answer 1 · 2011-09-28T21:42:10+0000

I don't know how EMR lists all the files, but here is a snippet of code that works for me:

  FileSystem fs = FileSystem.get(URI.create(args[0]), job.getConfiguration()); FileStatus[] files = fs.listStatus(new Path(args[0])); for(FileStatus sfs:files){ FileInputFormat.addInputPath(job, sfs.getPath()); }

It will display all the files that are in the input directory, and you can do everything that you will

Multiple files as input on Amazon Elastic MapReduce

More articles: