Large file merge algorithm

Question

Large file merge algorithm

I have several event log files (one event per line). Logs may overlap. Logs are created on separate client machines from perhaps several time zones (but I assume I know the time zone). Each event has a time stamp that was normalized to normal time (by instantiating each instance of the log parser calendar with the time zone corresponding to the log file, and then using getTimeInMillis to get the UTC time). Logs are already sorted by timestamp. Several events can occur simultaneously, but they are by no means equal.

These files can be relatively large, for example, 500,000 events or more in one log, so reading the entire contents of the logs into a simple Event [] is not possible.

What I'm trying to do is combine events from each journal into one journal. This is similar to the mergesort task, but each log is already sorted, I just need to put them together. The second component is that the same event can be witnessed in each individual log file, and I want to “delete repeating events” in the file output log.

Is it possible to do this “in place”, as in, to work consistently on some small buffers of each log file? I can’t just read in all the files in Event [], sort the list and delete duplicates, but so far my limited programming capabilities allow me to see this as a solution. Is there an even more complicated approach that I can use for this when I read events from each of the magazines at the same time?

+4

java sorting merge file

Josh 24 sept '08 at 16:00

source share

6 answers

Sure - open each log file. Read the first line for each into an array of “current” lines. Then repeatedly select the line with the lowest timestamp from the current array. Write it in the output and read the new line from the corresponding source file to replace it.

Here is an example in Python, but it also makes good pseudocode:

def merge_files(files, key_func): # Populate the current array with the first line from each file current = [file.readline() for file in files] while len(current) > 0: # Find and return the row with the lowest key according to key_func min_idx = min(range(len(files)), key=lambda x: return key_func(current[x])) yield current[min_idx] new_line = files[min_idx].readline() if not new_line: # EOF, remove this file from consideration del current[min_idx] del files[min_idx] else: current[min_idx] = new_line

+4

Nick johnson 24 sept '08 at 16:09

source share

Check out this link: http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194

Use heap (based on array). The number of elements in this heap / array will be equal to the number of log files that you have.
Read the first entries from all files and paste them into your heap.
Loop to (no more entries in any of the files)

  > remove the max element from the heap
       > write it to the output
       > read the next record from the file to which the (previous) max element belonged
           if there are no more records in that file
               remove it from file list
               continue
       > if it not the same as the (previous) max element, add it to the heap

Now you have all your events in one log file, they are sorted and there are no duplicates. The temporal complexity of the algorithm is (n log k), where n is the total number of entries and k is the number of log files.

You should use buffered read and buffer objects to write to and from files to minimize the number of disk reads and writes to optimize time.

+1

user130198 Jun 29 '09 at 12:51

source share

We needed to chronologically combine several log files having several lines in one log record (Java applications do this often - their stack traces are the same). I decided to implement a simple shell + perl script. It covers our tasks. If you are interested, follow the link http://code.google.com/p/logmerge/

+1

siberia-man Jun 25 '12 at 6:26

source share

Read only one line at a time from both source files. Compare the lines and write the high order in the output file (and go to the next line). Do this until you reach the end of both files and you merge the files.

And do not forget to remove duplicates :)

I assume this C # code might illustrate the approach:

  StringReader fileStream1; StringReader fileStream2; Event eventCursorFile1 = Event.Parse(fileStream1.ReadLine()); Event eventCursorFile2 = Event.Parse(fileStream2.ReadLine()); while !(fileStream1.EOF && fileStream2.EOF) { if (eventCursorFile1.TimeStamp < eventCursorFile2.TimeStamp) { WriteToMasterFile(eventCursorFile1); eventCursorFile1 = Event.Parse(fileStream1.ReadLine()); } else if (eventCursorFile1.TimeStamp == eventCursorFile2.TimeStamp) { WriteToMasterFile(eventCursorFile1); eventCursorFile1 = Event.Parse(fileStream1.ReadLine()); eventCursorFile2 = Event.Parse(fileStream2.ReadLine()); } else { WriteToMasterFile(eventCursorFile1); eventCursorFile2 = Event.Parse(fileStream2.ReadLine()); } }

The break condition is not entirely correct, as it is just Quick'n'dirty, but it should look similar.

0

Tigraine 24 sept '08 at 16:15

source share

OR you could borrow the magazine merge utility from Awstats, which is an open source site statistics tool.

logresolvemerge.pl is a perl script that can combine multiple log files: you can even use multiple threads to combine log files (you must use perl 5.8 for multi-threaded use). Why aren’t you trying to use an easily accessible tool and not building it?

0

anjanb 24 sept '08 at 19:00

source share

Adam tegen · Accepted Answer · 2008-09-24T16:09:10+0000

Read the first line from each log file
Loop
a. Find the "earliest" line.
b. Insert the “earliest” line in the main log file
with. Read the next line from the file containing the earliest line

You can check for duplicates between b and c by moving the pointer for each of these files.

Large file merge algorithm

More articles: