Finding Matching Strings Using Hadoop / MapReduce

Question

Finding Matching Strings Using Hadoop / MapReduce

I play with Hadoop and created two node clusters on Ubuntu. The WordCount example works fine.

Now I would like to write my own MapReduce program to analyze some log data (main reason: it looks simple and I have a lot of data)

Each line in the log has this format.

<UUID> <Event> <Timestamp>

where the event can be INIT, START, STOP, ERROR and some others. What interests me most is the elapsed time between the START and STOP events for the same UUID.

For example, my journal contains entries like these

 35FAA840-1299-11DF-8A39-0800200C9A66 START 1265403584 [...many other lines...] 35FAA840-1299-11DF-8A39-0800200C9A66 STOP 1265403777

My current linear program reads files, remembers the initial events in memory and writes the elapsed time to the file after it has detected the corresponding final event (lines with other events are currently ignored, ERROR events are invalid UUIDs and it will also be ignored) ¹

I would like to pass this into the Hadoop / MapReduce program. But I'm not sure how to match records. Split / tokenize a file is simple, and I think finding matches will reduce the class. But what will it look like? How to find math records in a MapReduce job?

Please keep in mind that my main focus is on understanding Hadopo / MapReduce; references to Pig and other Apache programs are welcome, but I would like to solve this problem with pure Hadoop / MapReduce. Thanks.

¹⁾ Since the log is taken from the running application, some launch events may not yet have the corresponding end events, and end events will occur without starting events due to the splitting of the log file

+4

java mapreduce hadoop

phisch Feb 05 '10 at 21:22

source share

2 answers

I think you could do this by making your map function output UUID as your key, and the rest of the string as your value. Then a set of all log entries with the same UUID will be passed to the reduce function. As he processes them, he can track the various events that he sees and take appropriate action - for example, when he sees a START event, he can set a local variable to the time extracted from the start line, and then when he sees a STOP event, it can extract time from it, subtract the start time and print the difference (and do this if he sees STOP before START).

+3

aem Feb 05 '10 at 21:36

source share

Leonidas · Accepted Answer · 2010-02-05T21:36:22+0000

If you select the UUID on the map as the key: emit(<uuid>, <event, timestamp>) , you will receive in your collapse all the events of this UUID: key = UUID, values = {<event1, timestamp1>, <event2, timestamp2>}

Then you can sort the events by timestamp and decide whether to publish them to the resulting file or not.

Bonus: you can use job.setSortComparatorClass(); to set up your own sort class, so you can sort your records at your timestamps in decreasing order:

 public static class BNLSortComparator extends Text.Comparator { public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { String sb1, sb2; try { sb1 = Text.decode(b1, s1, l1); ...

Finding Matching Strings Using Hadoop / MapReduce

More articles: