Hadoop and Python: disable sorting

I realized that when running Hadoop with Python code, either the cartographer or the reducer (not sure if) sorts my output before printing it using reducer.py . It is currently sorted alphabetically. I am wondering if there is a way to completely disable this. I would like the output of the program to be based on the order in which it was printed with mapper.py . I found answers in Java, but none of them are for Python. Do I need to modify mapper.py or perhaps command line arguments?

+2
source share
2 answers

You should read more about the basic concepts of MapReduce. Although sorting may not be necessary in some cases, the shuffled portion of the Shuffle and Sort phase is an integral part of the MapReduce model. The MapReduce (Hadoop) structure must group the output of the cartographers so that they transmit all the keys together to one reducer, so that the reducer can actually "reduce" the data. When streaming, pairs of key values ​​— by default — are separated by a tab value. From your code example in other SO questions, I see that you are not providing key, value tuples, but just separate text strings.

EDIT: added the following answer to the question "How to make it sortable numerically (for example, 9 to 10)?"

Alternative 1: Prepare zeros for your keys so that they all have the same size. "09" to "10".

Alternative 2: use KeyFieldBasedComparator as pointed out in this SO question .

+1
source

No , as indicated here

:

If the number of reduction tasks is not 0, the hadoop structure will sort your results. There is no such thing.

+1
source

Source: https://habr.com/ru/post/907284/


All Articles