Multiple files as input on Amazon Elastic MapReduce

I am trying to start work on Elastic MapReduce (EMR) using a special jar. I am trying to process about 1000 files in one directory. When I submit my work with the s3n://bucketname/compressed/*.xml.gz , I get a "matching 0 files" error. If I transfer only the absolute path to the file (for example, s3n://bucketname/compressed/00001.xml.gz ), it works fine, but only one file is processed. I tried using the directory name ( s3n://bucketname/compressed/ ), hoping that the files inside would be processed, but that just passes the directory to the job.

At the same time, I have a small local installation. In this case, when I transfer my work using wildcards ( /path/to/dir/on/hdfs/*.xml.gz ), it works fine, and all 1000 files are listed correctly.

How do I get EMR to display all my files?

+6
source share
1 answer

I don't know how EMR lists all the files, but here is a snippet of code that works for me:

  FileSystem fs = FileSystem.get(URI.create(args[0]), job.getConfiguration()); FileStatus[] files = fs.listStatus(new Path(args[0])); for(FileStatus sfs:files){ FileInputFormat.addInputPath(job, sfs.getPath()); } 

It will display all the files that are in the input directory, and you can do everything that you will

+2
source

Source: https://habr.com/ru/post/893176/


All Articles