The relationship between python map reduce and cloud-computing map / reduce?

I'm new to Python

Does anyone know what relationships between Python functions (and functional languages) map() / reduce() and the concept of MapReduce are related to distributed computing?

+6
source share
2 answers

The map / reduce cloud concept is very similar, but has changed to work in parallel. First, each data object is passed through a function that map to a new object (usually a kind of dictionary). Then, the reduce function is called in pairs of objects returned by map until only one remains. This is the result of the map / reduce operation.

One important consideration is that, due to parallelization, the reduce function must be able to accept objects from the map function, as well as objects from previous reduce functions. This makes sense when you think about how parallelization is going. Many machines will reduce their data to one object, and then these objects will be reduced to the final result. Of course, this can happen in several layers if there is a lot of data.

Here is a simple example of how you can use the map / reduce structure to count the words in a list:

 list = ['a', 'foo', 'bar', 'foobar', 'foo', 'a', 'bar', 'bar', 'bar', 'bar', 'foo'] list2 = ['b', 'foo', 'foo', 'b', 'a', 'bar'] 

The map function will look like this:

 def wordToDict(word): return {word: 1} 

And the reduction function will look like this:

 def countReduce(d1, d2): out = d1.copy() for key in d2: if key in out: out[key] += d2[key] else: out[key] = d2[key] return out 

Then you can show / hide like this:

 reduce(countReduce, map(wordToDict, list + list2)) >>> {'a': 3, 'foobar': 1, 'b': 2, 'bar': 6, 'foo': 5} 

But you can also do it like this (which parallelization will do):

 reduce(countReduce, [reduce(countReduce, map(wordToDict, list)), reduce(countReduce, map(wordToDict, list2))]) >>> {'a': 3, 'foobar': 1, 'b': 2, 'foo': 5, 'bar': 6} 
+8
source

In fact, these concepts are somewhat different, and common names are misleading.

In functional programming (where Python borrowed these functions):

  • map applies some function to all elements of the list and returns a new list
  • reduce uses some function to aggregate all the values โ€‹โ€‹in a list to get a single value.

In MapReduce Distributed Computing:

  • we always work with key-value pairs (well, only pairs)
  • mapper takes a list of pairs and creates another list of pairs (the input key loses its semantics in this context)
  • the reducer receives a key and a list of values โ€‹โ€‹corresponding to this key (from the output file of the map) and gives a list of keys and values โ€‹โ€‹(one place where the key has key semantics is input / mapper output: values โ€‹โ€‹are grouped by key before going to the reducer)
  • you can also have a separator and adder :)

Note that none of the cards always creates one output pair for each input pair, and the reducer always reduces each (key, list of values) to one output pair. Mapper and gearbox can output whatever they want. For example, mapper can be used to filter pairs - in this case, it creates a pair of output data for some input pairs and ignores others. It is also often necessary to output more than one pair for each pair of input pairs of cards / reducers (or for some of them).

But in most cases, MapReduce can work in a similar or almost the same way as reduce(reduce_function, map(map_function, list)) - mapper usually does some calculations for each input, and the reducer usually aggregates the list of values โ€‹โ€‹in some way. For any map_function and reduce_function this can be expressed in MapReduce, but not vice versa.

+1
source

Source: https://habr.com/ru/post/901414/


All Articles