Hadoop: Specify a directory as a contribution to MapReduce

I am using Cloudera Hadoop. I can run a simple mapreduce program, where I provide the file as input to the MapReduce program.

This file contains all the other files that will be processed using the mapper function.

But I was stuck at some point.

/folder1 - file1.txt - file2.txt - file3.txt 

How can I specify the input path to MapReduce as "/folder1" so that it can start processing each file inside this directory?

Any ideas?

EDIT:

1) Intailly, I introduced inputFile.txt as input to the mapreduce program. It worked great.

 >inputFile.txt file1.txt file2.txt file3.txt 

2) But now instead of giving the input file, I want to provide the input directory as arg [0] on the command line.

 hadoop jar ABC.jar /folder1 /output 
+6
source share
4 answers

The problem is that FileInputFormat does not recursively read files in the dir input directory.

Solution: Use the following code

FileInputFormat.setInputDirRecursive(job, true); Decrease the code before you draw the map below.

FileInputFormat.addInputPath(job, new Path(args[0]));

You can check here for which version it has been fixed.

+11
source

you can use FileSystem.listStatus to get a list of files from a given directory, the code could be as follows:

 //get the FileSystem, you will need to initialize it properly FileSystem fs= FileSystem.get(conf); //get the FileStatus list from given dir FileStatus[] status_list = fs.listStatus(new Path(args[0])); if(status_list != null){ for(FileStatus status : status_list){ //add each file to the list of inputs for the map-reduce job FileInputFormat.addInputPath(conf, status.getPath()); } } 
+2
source

you can use hdfs wildcards to provide multiple files

therefore the solution:

 hadoop jar ABC.jar /folder1/* /output 

or

 hadoop jar ABC.jar /folder1/*.txt /output 
+1
source

Use the MultipleInputs class.

 MultipleInputs. addInputPath(Job job, Path path, Class<? extends InputFormat> inputFormatClass, Class<? extends Mapper> mapperClass) 

Look at the code work

0
source

Source: https://habr.com/ru/post/958472/


All Articles