Can I use a single Hadoop job run to output data to different key-based directories?
My use case is server access logs. Say I have them all together, but I want to break them down based on common URL patterns.
For instance,
- Everything that starts with / foo / should go to / year / month / day / hour / foo / file
- Everything that starts with / bar / should go to / year / month / day / hour / bar / file
- Anything that doesn't match should go to / year / month / day / hour / other / file
There are two problems here (from my understanding of Map Reduce): firstly, I would rather just iterate over my data once, instead of running a single "grep" job on the type of URL that I would like to map. How would I split the output? If I write the first one with "foo", the second with "bar", and the rest with "different", aren't they all still going to the same gears? How do I tell Hadoop to output them to different files?
The second problem is related (maybe the same?), I need to disable output by timestamp in the access log line.
I should note that I am not looking for code to solve this problem, but rather the correct terminology and a high-level solution to study. If I need to do this with multiple starts, that’s good, but I can’t run one “grep” for every possible hour (to make a file in this hour), should there be another way?
source
share