I play with Hadoop and created two node clusters on Ubuntu. The WordCount example works fine.
Now I would like to write my own MapReduce program to analyze some log data (main reason: it looks simple and I have a lot of data)
Each line in the log has this format.
<UUID> <Event> <Timestamp>
where the event can be INIT, START, STOP, ERROR and some others. What interests me most is the elapsed time between the START and STOP events for the same UUID.
For example, my journal contains entries like these
35FAA840-1299-11DF-8A39-0800200C9A66 START 1265403584 [...many other lines...] 35FAA840-1299-11DF-8A39-0800200C9A66 STOP 1265403777
My current linear program reads files, remembers the initial events in memory and writes the elapsed time to the file after it has detected the corresponding final event (lines with other events are currently ignored, ERROR events are invalid UUIDs and it will also be ignored) 1
I would like to pass this into the Hadoop / MapReduce program. But I'm not sure how to match records. Split / tokenize a file is simple, and I think finding matches will reduce the class. But what will it look like? How to find math records in a MapReduce job?
Please keep in mind that my main focus is on understanding Hadopo / MapReduce; references to Pig and other Apache programs are welcome, but I would like to solve this problem with pure Hadoop / MapReduce. Thanks.
1) Since the log is taken from the running application, some launch events may not yet have the corresponding end events, and end events will occur without starting events due to the splitting of the log file