What is the Mapper of Reducer () setting?

What exactly are the tuning and cleaning methods used for? I tried to figure out what they mean, but no one could describe exactly what they were doing. For example, how does the installation method use data from the input partition? Does he accept this as a whole? or line by line?

+6
source share
3 answers

As already mentioned, setup() and cleanup() are methods that you can override if you choose, and they are designed to initialize and clean up map / shortcut tasks. In fact, you do not have access to any data from the input splitting directly at these stages. The life cycle of the map / reduce task (from the point of view of the programmer):

setup β†’ map β†’ cleanup

setting β†’ decrease β†’ cleaning

What usually happens during setup() is that you can read the parameters from the configuration object to configure the processing logic.

What usually happens during cleanup() is that you clean up all the resources that you have allocated. There are other uses that should discard any accumulation of cumulative results.

The setup() and cleanup() methods are just β€œhooks” for you, the developer / programmer, to be able to do something before and after your map / reduce tasks.

For example, in the example of canonical word count, you can say that you want to exclude certain words from the count (for example, stop words such as "the", "a", "be", etc.). When you set up your MapReduce task, you can pass a list (comma-delimited) of these words as a parameter (a pair of key values) to the configuration object. Then in your map code during setup() you can get stop words and save them in some global variable (global variable for the map task) and exclude the counting of these words during your map logic. The following is a modified example of http://wiki.apache.org/hadoop/WordCount .

 public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); private Set<String> stopWords; protected void setup(Context context) throws IOException, InterruptedException { Configuration conf = context.getConfiguration(); stopWords = new HashSet<String>(); for(String word : conf.get("stop.words").split(",")) { stopWords.add(word); } } public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { String token = tokenizer.nextToken(); if(stopWords.contains(token)) { continue; } word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); conf.set("stop.words", "the, a, an, be, but, can"); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } } 
+16
source
 setup: Called once at the beginning of the task. 

Here you can enter custom initialization.

 cleanup: Called once at the end of the task. 

Here you can place the resource.

+2
source

tuning and cleaning are called once for each task.
For example, you have 5 mappers working, for each mapper you want to initialize some values, then you can use the setting. Your installation method is called 5 times.
So, for each display method, the setup() method is first called, then the map()/reduce() method is called, and then the cleanup() method is called before exiting the task.

+1
source

Source: https://habr.com/ru/post/974171/


All Articles