Hadoop to reduce from multiple input formats

I have two files with different data formats in HDFS. How would the task be configured if I needed to reduce all data files?

eg. Imagine the problem of the total number of words, where in one file you have space as a world separator and in another file an underscore. In my approach, I need different cartographers for different file formats than the general reducer.

How to do it? Or is there a better solution than mine?

+3
source share
1 answer

Check out the MultipleInputs class that solves this problem. This is pretty neat - you go into InputFormat and not necessarily in the Mapper class.

If you are looking for google code examples, find โ€œSide Reduction Connectionโ€, where this method is commonly used.


On the other hand, sometimes itโ€™s easier for me to just use a hack. For example, if you have one set of files with a space separator and the other with an underscore separator, load both with the same mapper and TextInputFormat and check both possible separators. Count the number of tokens from two results. In the example of word counting, select one of them with more tokens.

This also works if both files are the same separator but have a different number of standard columns. You can tokenize the comma, and then see how many tokens there are. If it is 5 tokens, this is from dataset A, if it is 7 tokens, this is from dataset B.

+4
source

Source: https://habr.com/ru/post/944548/


All Articles