Hadoop: Specify a directory as a contribution to MapReduce

Question

Hadoop: Specify a directory as a contribution to MapReduce

I am using Cloudera Hadoop. I can run a simple mapreduce program, where I provide the file as input to the MapReduce program.

This file contains all the other files that will be processed using the mapper function.

But I was stuck at some point.

/folder1 - file1.txt - file2.txt - file3.txt

How can I specify the input path to MapReduce as "/folder1" so that it can start processing each file inside this directory?

Any ideas?

EDIT:

1) Intailly, I introduced inputFile.txt as input to the mapreduce program. It worked great.

 >inputFile.txt file1.txt file2.txt file3.txt

2) But now instead of giving the input file, I want to provide the input directory as arg [0] on the command line.

 hadoop jar ABC.jar /folder1 /output

+6

java input mapreduce hadoop cloudera

Javascript is GOD Nov 20 '13 at 11:13

source share

4 answers

shashaDenovo · Answer 1 · 2014-05-28T09:33:50+0000

The problem is that FileInputFormat does not recursively read files in the dir input directory.

Solution: Use the following code

FileInputFormat.setInputDirRecursive(job, true); Decrease the code before you draw the map below.

FileInputFormat.addInputPath(job, new Path(args[0]));

You can check here for which version it has been fixed.

zhutoulala · Answer 2 · 2013-11-20T13:14:12+0000

you can use FileSystem.listStatus to get a list of files from a given directory, the code could be as follows:

 //get the FileSystem, you will need to initialize it properly FileSystem fs= FileSystem.get(conf); //get the FileStatus list from given dir FileStatus[] status_list = fs.listStatus(new Path(args[0])); if(status_list != null){ for(FileStatus status : status_list){ //add each file to the list of inputs for the map-reduce job FileInputFormat.addInputPath(conf, status.getPath()); } }

Dmitry · Answer 3 · 2015-11-07T11:02:32+0000

you can use hdfs wildcards to provide multiple files

therefore the solution:

 hadoop jar ABC.jar /folder1/* /output

or

 hadoop jar ABC.jar /folder1/*.txt /output

Ravindra babu · Answer 4 · 2016-01-07T15:27:20+0000

Use the MultipleInputs class.

 MultipleInputs. addInputPath(Job job, Path path, Class<? extends InputFormat> inputFormatClass, Class<? extends Mapper> mapperClass)

Look at the code work

Hadoop: Specify a directory as a contribution to MapReduce

More articles: