Hadoop reducer values ​​in memory?

I am writing a MapReduce work that can end up with a huge amount of values ​​in a reducer. I am worried about all these values ​​that are immediately loaded into memory.

Is the basic implementation of Iterable<VALUEIN> values loaded into memory as needed? Hadoop: The final guide seems to suggest that it is, but does not provide a β€œfinal” answer.

The output of the gearbox will be much more massive than the input values, but I believe that the output is written to disk as necessary.

+6
source share
3 answers

You are reading the book correctly. The reducer does not store all values ​​in memory. Instead, when you go through the Iterable list of values, each instance of the object is reused, so it saves only one instance at a time.

For example, in the following objs code, the ArrayList will have the expected size after the loop, but each element will be the same b / c that the Text val instance is reused for each iteration.

 public static class ReducerExample extends Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterable<Text> values, Context context) { ArrayList<Text> objs = new ArrayList<Text>(); for (Text val : values){ objs.add(val); } } } 

(If for some reason you want to take further action for each val, you should make a deep copy and then save it.)

Of course, even one value can be more than memory. In this case, he recommended that the developer take steps to process the data down in the previous Mapper so that the value is not so large.

UPDATE . See pages 199-200 Hadoop. The ultimate guide. 2nd edition.

 This code snippet makes it clear that the same key and value objects are used on each invocation of the map() method -- only their contents are changed (by the reader next() method). This can be a surprise to users, who might expect keys and vales to be immutable. This causes prolems when a reference to a key or value object is retained outside the map() method, as its value can change without warning. If you need to do this, make a copy of the object you want to hold on to. For example, for a Text object, you can use its copy constructor: new Text(value). The situation is similar with reducers. In this case, the value object in the reducer iterator are reused, so you need to copy any that you need to retain between calls to the iterator. 
+12
source

It is not entirely in memory, some of them come from disk, and looking at the code it seems that the structure breaks Iterable into segments and loads them into disk 1 in memory.

org.apache.hadoop.mapreduce.task.ReduceContextImpl org.apache.hadoop.mapred.BackupStore

+2
source

As other users quote, the whole data was not loaded into memory. Take a look at some mapred-site.xml options from the Apache documentation.

 mapreduce.reduce.merge.inmem.threshold 

The default value is 1000. This is a threshold value in terms of the number of files for the merge process in memory.

 mapreduce.reduce.shuffle.merge.percent 

The default value is 0.66. The usage threshold at which a memory merge will be initiated, expressed as a percentage of the total memory allocated for storing memory card outputs in memory, as defined by mapreduce.reduce.shuffle.input.buffer.percent .

 mapreduce.reduce.shuffle.input.buffer.percent 

The default value is 0.70. The percentage of memory that must be allocated from the maximum heap size to store card outputs during shuffling.

 mapreduce.reduce.input.buffer.percent 

The default value is 0. Percentage of memory relative to the maximum heap size β€” to save the card outputs during reduction. When the shuffle is complete, any remaining card outputs in memory must consume less than this threshold before reduction begins.

 mapreduce.reduce.shuffle.memory.limit.percent 

The default value is 0.25. The maximum percentage of memory limit that one random choice can use,

0
source

Source: https://habr.com/ru/post/918045/


All Articles