Using Hadoop to Output Single-Run Bucket Data

Can I use a single Hadoop job run to output data to different key-based directories?

My use case is server access logs. Say I have them all together, but I want to break them down based on common URL patterns.

For instance,

  • Everything that starts with / foo / should go to / year / month / day / hour / foo / file
  • Everything that starts with / bar / should go to / year / month / day / hour / bar / file
  • Anything that doesn't match should go to / year / month / day / hour / other / file

There are two problems here (from my understanding of Map Reduce): firstly, I would rather just iterate over my data once, instead of running a single "grep" job on the type of URL that I would like to map. How would I split the output? If I write the first one with "foo", the second with "bar", and the rest with "different", aren't they all still going to the same gears? How do I tell Hadoop to output them to different files?

The second problem is related (maybe the same?), I need to disable output by timestamp in the access log line.

I should note that I am not looking for code to solve this problem, but rather the correct terminology and a high-level solution to study. If I need to do this with multiple starts, that’s good, but I can’t run one “grep” for every possible hour (to make a file in this hour), should there be another way?

+3
source share
1 answer

You need to break up the data as you describe. Then you need to have some output files. See here ( Creating multiple output files using Hadoop 0.20+ ).

+1
source

Source: https://habr.com/ru/post/1779048/


All Articles