How to pass an argument to the main program in Hadoop

Every time I run my Hadoop program, I need to change the number of cards and reducers. Is there a way to transfer the number of mappers and reducers to my program from the command line (when I run the program) and then use args to get it?

+4
source share
2 answers

It is important to understand that you cannot specify the number of map tasks. Ultimately, the number of map tasks is defined as the number of input sections , which depends on your implementation of InputFormat . Let's say you have 1 TB of input, and the HDFS block size is 64 MB, so Hadoop will calculate about 16 thousand tasks of the card, and from there, if you specify a manual value of less than 16 thousand, it will be ignored, but more than 16 thousand ., And it will be.

To get through the command line, the easiest way is to use the built-in GenericOptionsParser class (described here), which will directly analyze the general command line. Hadoop-related arguments, like what you are trying to do. The good thing is that it allows you to pass almost any Hadoop parameters you want, without having to write additional code later. You would do something like this:

 public static void main(String[] args) { Configuration conf = new Configuration(); String extraArgs[] = new GenericOptionsParser(conf, args).getRemainingArgs(); // do something with your non-Hadoop parameters if needed } 

Now the properties that need to be determined to change the number of cartographers and reducers are mapred.map.tasks and mapred.reduce.tasks , mapred.reduce.tasks , so you can simply start your task using these parameters:

 -D mapred.map.tasks=42 -D mapred.reduce.tasks 

and they will be directly analyzed with your GenericOptionParser and automatically populate your Configuration object. Note that there is a space between -D and properties, this is important, otherwise it will be interpreted as JVM parameters.

Here is a good link if you want to know more about it.

+7
source

You can specify the number of cards and reducers (and indeed any parameter that you can specify in the config) using the -D parameter. This works for all default Hadoop cans and your own cans as long as you extends Configured .

 hadoop jar myJar.jar -Dmapreduce.job.maps=<Number of maps> -Dmapreduce.job.reduces=<Number of reducers> 

From there you can get the values ​​with.

 configuration.get("mapreduce.job.maps"); configuration.get("mapreduce.job.reduces"); 

or for gearboxes

 job.getNumReduceTasks(); 

Specifying maps with configuration values ​​will not work if mapreduce.jobtracker.address is "local" . See Charles Response for an explanation of how Hadoop typically determines the number of Mappers by data size.

+1
source

Source: https://habr.com/ru/post/1483052/


All Articles