Key sorting is best understood if we look at the fact that hasoop DISTRIBUTES processes you by sending different keys to different machines. The main (simplified) version of the idea is as follows:
The reducer which a (k,v) pair is sent to = k.hashCode()%num_of_machines.
So, if my hashcode key is 10, and I have 2 cars, the key will be sent, for example, to machine # 0.
Thus, the key (first) gives us a simple way to distribute the calculations.
In addition to simplifying the distribution of calculations, keys give us the ability to combine records from disparate data files into one cluster. For example, we can do things like word_count.
In fact, if you find that you don't need keys, you probably don't need hadoop either!
Classic example (word count):
In the hadoop “number of words” example, we highlight keys (one key = one word) with values (# times this word was seen in a text segment). This allows SINGLE to reduce the function to get a single word and thus add all the time that he saw, creating the exact number of words.
Thus, key aggregation is what allows the map phase to be distributed across multiple machines independently of each other. Without combining the keys to the same reducer in the word-counting example, we can get several words for a given word, since there is no guarantee that one reducer will receive the entire number of words from all files.
Another example:
Now ... let's say we have social security numbers as identifiers, and we want to infer aggregation of personal data. Let's say we have 2 massive files.
ssn-> name
ssn-> types SHOE_SIZE
In this case, we can use the power of grouping keys, so that the individual’s name and shoe size BOTH are sent to the SAME reduce function.
The gearbox (2) will get 2 entries here:
ssn-> name, types SHOE_SIZE
The idea here is that when writing tasks on the map / abbreviation, you should encode your “tuples”, which are displayed in such a way that they can be combined in a meaningful way during the abbreviation phase. Any distributed computing environment, probably at some point should combine records calculated in different nodes. The keys give us a convenient and scalable methodology for this.
So, the fact that we guarantee that SAME keys switch to the EXACT function of the reducer confirms that the EACH reducer for this particular social Secuirty number will receive ALL the data associated with this number, which allows us to combine and output data records, which including ssn, name and size of shoes.
Conclusion
Without key distribution, combining data in this way would require painfully complex logic related to some kind of intermediate data storage / caching. Hadoop simply generalizes and abstracts the general need to “join” the data results from parallel computing using the familiar pardigm keys and values.