Where is sorting used in the MapReduce phase and why?

I am new to hadoop here. It is not clear why we should be able to sort by key when using hasoop mapreduce? After the card phase, we need to distribute the data corresponding to each unique key to a certain number of reducers. Can this be done without having to sort it correctly?

+6
source share
2 answers

It is, because sorting is a neat trick to group your keys. Of course, if your work or algorithm does not require any order of your keys, then you will group faster with a hashing trick.

Hadoop itself already has a JIRA filed for this since the year ( source ). Several other distributions that override Hadoop already have these features, for example Hanborq (they call it sorting). ( source )

To your actual question (why), MapReduce was essentially a document from Google ( source ) that states the following:

We guarantee that in this section the intermediate key / value pairs are processed with increasing order of keys. This order guarantee makes it easy to create a sorted output file per partition, which is useful when the output file format must support effective random access to search by keyword or output user sorting data.

Thus, it was a more convenient solution to support sorting, but not inherently allow sorting only for grouping keys.

+14
source

Key sorting is best understood if we look at the fact that hasoop DISTRIBUTES processes you by sending different keys to different machines. The main (simplified) version of the idea is as follows:

The reducer which a (k,v) pair is sent to = k.hashCode()%num_of_machines. 

So, if my hashcode key is 10, and I have 2 cars, the key will be sent, for example, to machine # 0.

Thus, the key (first) gives us a simple way to distribute the calculations.

In addition to simplifying the distribution of calculations, keys give us the ability to combine records from disparate data files into one cluster. For example, we can do things like word_count.

In fact, if you find that you don't need keys, you probably don't need hadoop either!

Classic example (word count):

In the hadoop “number of words” example, we highlight keys (one key = one word) with values ​​(# times this word was seen in a text segment). This allows SINGLE to reduce the function to get a single word and thus add all the time that he saw, creating the exact number of words.

Thus, key aggregation is what allows the map phase to be distributed across multiple machines independently of each other. Without combining the keys to the same reducer in the word-counting example, we can get several words for a given word, since there is no guarantee that one reducer will receive the entire number of words from all files.

Another example:

Now ... let's say we have social security numbers as identifiers, and we want to infer aggregation of personal data. Let's say we have 2 massive files.

ssn-> name

ssn-> types SHOE_SIZE

In this case, we can use the power of grouping keys, so that the individual’s name and shoe size BOTH are sent to the SAME reduce function.

The gearbox (2) will get 2 entries here:

ssn-> name, types SHOE_SIZE

The idea here is that when writing tasks on the map / abbreviation, you should encode your “tuples”, which are displayed in such a way that they can be combined in a meaningful way during the abbreviation phase. Any distributed computing environment, probably at some point should combine records calculated in different nodes. The keys give us a convenient and scalable methodology for this.

So, the fact that we guarantee that SAME keys switch to the EXACT function of the reducer confirms that the EACH reducer for this particular social Secuirty number will receive ALL the data associated with this number, which allows us to combine and output data records, which including ssn, name and size of shoes.

Conclusion

Without key distribution, combining data in this way would require painfully complex logic related to some kind of intermediate data storage / caching. Hadoop simply generalizes and abstracts the general need to “join” the data results from parallel computing using the familiar pardigm keys and values.

+1
source

Source: https://habr.com/ru/post/921789/


All Articles