Is the Wikipedia Explanation for a Map Decrease Incorrect?

the MongoDB explanation of the reduction phase says:

The map / reduce module can reference iteration functions iteratively; Therefore, these functions must be idempotent.

This is how I always understood what needs to be reduced to working in a common environment with map reduction. Here you can summarize values ​​for N machines, decreasing values ​​on each machine, and then sending these outputs to another gearbox.

Wikipedia says :

The structure calls the application Reduce function once for each unique key in sorted order. The abbreviation can go through the values ​​that are associated with this key and produce zero or more outputs.

Here you will need to move all the values ​​(with the same key) to the same computer that will be summed. Moving data into a function seems to be the opposite of what map reduction should do.

Is the description of Wikipedia too specific? Or did MongoDB break the reduction map? (Or am I missing something here?)

+4
source share
3 answers

Thus, the original Map Reduce infrastructure is described by Google :

2 programming model

[...]

Intermediate values ​​are provided to reduce users through an iterator. This allows us to process lists of values ​​that are too large to fit in memory.

And later:

3 Implementation

[...]

6. Reduced worker iterates over the sorted intermediate data and for each unique unique intermediate key encountered, passes the key and the corresponding set of intermediate values ​​to the Reduce users function.

Thus, there is only one call to Reduce . The problem of moving a large number of small intermediate pairs is solved using a special function of the combiner locally:

4.3 Combineer function

In some cases, there is a significant repetition of the intermediate keys generated by each task of the card [...] We allow the user to specify an optional Combiner function that partially merges this data before it is sent over the network.

The Combiner function is performed on each machine that performs the map task. Typically, the same code is used to implement combiner and abbreviation functions. [...]

Partial merging greatly accelerates certain classes of MapReduce operations.

TL DR

Wikipedia follows the original MapReduce design, MongoDB designers have taken a slightly different approach.

+3
source

According to Google Map MapReduce

When an abbreviation worker works with all intermediate data, he sorts it with intermediate keys so that all occurrences of the same key are grouped together.

MongoDB document says

The map / reduce module can reference iteration functions iteratively; therefore, these functions must be idempotent.

So, in the case of MapReduce, as defined in a Google document, the abbreviation starts processing key / value pairs as soon as the data for a particular key has been transferred to the reducer. But, as Tomas said, MongoDB seems to implement MapReduce several times.

In MapReduce, proposed by Google, the Map or Reduce tasks will process KV pairs, but in the MongoDB implementation, the Map and Reduce tasks will simultaneously process KV pairs. The MongoDB approach can be inefficient because nodes are not used efficiently, and there is a chance that the Map and Reduce slots in the cluster are full and might not start new jobs.

Catch in Hadoop, although reducer tasks do not process KV pairs until maps are processed by data, reducer tasks can be generated before the mappers finish processing. The parameter is "mapreduce.job.reduce.slowstart.completedmaps" and set to "0.05", and the description says "The scale of the number of cards in the task, which must be completed before the reductions are planned for work."

Here you will need to move all the values ​​(with the same key) to the same computer that will be summed. Moving data into a function seems to be the opposite of what map reduction should do.

In addition, the data area is considered for map tasks, not reduction tasks. For reduction tasks, data must be moved from different cartographers on different nodes to gearboxes for aggregation.

Just my 2c.

+1
source

TL; DR: the abbreviation (mongo) is similar to a combiner, and finalize (mongo) is almost like a reducer, except that it takes only one key / value. If you need to have all your data in a hadoop function, combine it with the reduction (mongo) into a large array and pass it to complete. To do this, use the flag values ​​in the output values.

How do I do this, and I think it is sucking for large data loads, but I don’t know another way to do it with mongodb mapreduce :( (but I am not very experienced with it)

0
source

Source: https://habr.com/ru/post/1440559/


All Articles