Combiner function in python hadoop stream

Question

Combiner function in python hadoop stream

I have a mapper that outputs a key and a value that is sorted and passed by .py reduction,

When the keys are already sorted, before I get to the reducer, I want to write a combiner that iterates through a sorted list and displays the key, a pair [v1, v2, v3], which will be used in the reducer.

cat data | python mapper.py | sort | python reducer.py

What is the best mechanism for writing a reducer, so that I do not use a dictionary containing all the keys, a lot of memory for storing entries in the dictionary.

+3

python mapreduce hadoop

Algoman Nov 24 '10 at 16:53

source share

1 answer

katrielalex · Accepted Answer · 2010-11-24T17:03:20+0000

Use itertools.groupby:

>>> import itertools
>>> import operator
>>> foo = [("a", 1), ("a", 2), ("b", 1), ("c", 1), ("c", 2)]
>>> for group in itertools.groupby(foo, operator.itemgetter(0)):
...     print group[0], list(map(operator.itemgetter(1), group[1]))
...
a [1, 2]
b [1]
c [1, 2]

Explanation:

groupby, , . , keyfunc , , keyfunc , , , . , ; , , groupby.

operator.itemgetter(0), "", x x[0]. , , .

, (, sys.stdin) . , , yield.

, , . , , : , , .

Combiner function in python hadoop stream

More articles: