The map / reduce cloud concept is very similar, but has changed to work in parallel. First, each data object is passed through a function that map to a new object (usually a kind of dictionary). Then, the reduce function is called in pairs of objects returned by map until only one remains. This is the result of the map / reduce operation.
One important consideration is that, due to parallelization, the reduce function must be able to accept objects from the map function, as well as objects from previous reduce functions. This makes sense when you think about how parallelization is going. Many machines will reduce their data to one object, and then these objects will be reduced to the final result. Of course, this can happen in several layers if there is a lot of data.
Here is a simple example of how you can use the map / reduce structure to count the words in a list:
list = ['a', 'foo', 'bar', 'foobar', 'foo', 'a', 'bar', 'bar', 'bar', 'bar', 'foo'] list2 = ['b', 'foo', 'foo', 'b', 'a', 'bar']
The map function will look like this:
def wordToDict(word): return {word: 1}
And the reduction function will look like this:
def countReduce(d1, d2): out = d1.copy() for key in d2: if key in out: out[key] += d2[key] else: out[key] = d2[key] return out
Then you can show / hide like this:
reduce(countReduce, map(wordToDict, list + list2)) >>> {'a': 3, 'foobar': 1, 'b': 2, 'bar': 6, 'foo': 5}
But you can also do it like this (which parallelization will do):
reduce(countReduce, [reduce(countReduce, map(wordToDict, list)), reduce(countReduce, map(wordToDict, list2))]) >>> {'a': 3, 'foobar': 1, 'b': 2, 'foo': 5, 'bar': 6}