Bottleneck in network bandwidth for sorting mapreduce intermediate keys?

Question

Bottleneck in network bandwidth for sorting mapreduce intermediate keys?

I am studying the mapreduce algorithm and how it can scale for millions of cars, but I do not understand how the sorting of intermediate keys after the map phase can scale, as it will:

1,000,000 x 1,000,000

: potential machines reporting small key / value pairs of intermediate results to each other? Isn't that a bottleneck?

+4

mapreduce hadoop

Zubair Mar 11 '10 at 8:42

source share

1 answer

Binary nerd · Accepted Answer · 2010-03-13T05:50:00+0000

It is true that one of the bottlenecks in Hadoop MapReduce is the network bandwidth between the machines in the cluster. However, exits from each phase of the card are not sent to each machine in the cluster.

The number of map and reduce functions is determined by the task you are doing. Each card processes its input data, sorts them to group keys and write to disk. The task determines how much to reduce the functions that you want to apply to the outputs from the cards.

Each snapshot should see all the data for this key. Therefore, if you had one reduction performed for a job, all output from each map should be sent to a node in the cluster, which works with reduction. Before starting the reduction, the data from each card is combined to group all the keys.

If multiple gearboxes are used, the cards share their output, creating one for reduction. Sections are sent to the correct snapshot. This ensures that all data for a given key is processed in one reduction.

To reduce the amount of data that needs to be sent over the network, you can apply the merge function to the output of the card. This leads to a reduction in output from the card. In this way, you can minimize the amount of data you need to transfer to gearboxes and speed up the overall work time.

Bottleneck in network bandwidth for sorting mapreduce intermediate keys?

More articles: