Comparing HUGE ASCII Files

Question

Comparing HUGE ASCII Files

I work for a company that works with ETL in various databases. I am tasked with creating a patch for two complete historical data sets on a client machine, which will then be sent to our servers. This patch must be software so that it can be called from our software.

Datasets are simple text files. We have extraction software running on our client systems to perform extraction. Extract files up to 3 GB +. I have implemented a solution using Microsoft FC.exe, but it has limitations.

I use FC to create a comparison file and then parse it in perl on our side to retrieve deleted / updated records and those that were added.

FC works fine for me as long as a line of text does not exceed 128 characters. When this happens, the output is placed on the next line of the comparison file and appears as an added / deleted entry. I know that I could probably pre-process the files, but this will add a huge amount of time, possibly defeating the target.

I tried using diffutils, but it complains about large files.

I also played with some C # code to implement the repair process myself. This worked fine for small files, but was terribly inefficient at working with large files (tested it on extracting 2.8 GB).

Are there any good command line utilities or C # libraries that I can use to create this patch file? If this is not the case, is there an algorithm I can use to implement it? Keep in mind that records can be updated, added and deleted (I know it also annoys me that DELETE clients record, but not mark them inactive. This is not my control.)

Edit for clarity:

I need to compare two separate database snippets from two different times. Usually it will be about one day.

Given the files below: (they will obviously be much longer and much wider)

Old.txt

a b c d e 1 f 2 5

New.txt

 a 3 b c 4 d e 1 f g

Expected Result:

 3 added 4 added 2 removed g added 5 removed

+6

c # sql diff etl

rbedger Jul 18 '13 at 18:38

source share

2 answers

Well, you are comparing 2 text files, each of them has entries that are not necessarily in any order, and I expect that the entry will have a certain format, if I understand this right, what you really have, be it: * represents the beginning of the record @ represents the end of the input so OLD.TXT * a @ * b @ * c @, etc. .... the rough "algorithm" will be: 1) make a copy of NEW, name it ADDED 2) get the record from OLD 3.0) scan ADDED for this entry. If there, save the record in the STILLEXISTS file, delete this record from the ADDED 3.1 file), if the record is not in ADDED, save the DELETED file and get the next record from OLD 4) When it ends, you will have 3 files, each with records, which were added, removed and the bonus “still there”, all in one pass;) I hope I understood it correctly, and it can help you.

0

user2605249 Jul 24 '13 at 2:10

source share

Steve ruble · Accepted Answer · 2013-07-24T11:32:54+0000

Here's a pretty effective solution - I think it's roughly O (n), but it depends on the distribution of additions and deletions. Memory consumption is quite low, but also depends on the number of consecutive additions and deletions.

Limitations:

This algorithm does not support patch lines in the same order as in the source file; if this is important, you can do something like a dictionary <string, int> where the key is the string and the value is the number of the original string, rather than using a HashSet <string> to keep track of the added and deleted strings.
The target ("new") file should be somewhat similar to the original ("old") file. In particular, all immutable lines must be in the same order in the source and target. If this condition is not met, the algorithm will behave badly.
Each line must be unique with respect to nearby lines, where “close” means between the nearest lines that do not change between source and target. If this condition is not met, the algorithm will skip the changes.
This implementation does not consider modified rows. I think you could add this functionality by replacing == comparisons with any operation that you use to determine that two lines are a “single” line, and then write them to the patch if they are the “same” line with content changes.

The algorithm uses a pair of “added” and “deleted” buffers to keep track of potentially added and deleted lines when passing files. Lines are conditionally marked as "added" or "deleted" if they do not match between files. When a pre-selected line is found in one of the files (if a “deleted” line is found in the target file or an “added” line is found in the source file), which is a signal that indicates that all lines in another buffer really belong there, so another buffer is dumped to the patch file, then the reader advances one line in the file, where a matching line is found.

For instance:

 
 Source Target Added Removed
 A ------- A _ _
 B ------- X + X + B
 C ------- B Flush X -B
 D - \ \ -C _ _
 E- \ \ --- E + E + D
 F \ ---- F -E Flush D

Here is the code:

 public void Diff( string sourcePath, string targetPath, string patchPath, string addedSuffix, string removedSuffix) { using(var sourceReader = new StreamReader(sourcePath)) using(var targetReader = new StreamReader(targetPath)) using(var patchWriter = new StreamWriter(patchPath, append:false)) { var sourceLine = sourceReader.ReadLine(); var targetLine = targetReader.ReadLine(); var added = new HashSet<string>(); var removed = new HashSet<string>(); do{ if(sourceLine == targetLine) { sourceLine = sourceReader.ReadLine(); targetLine = targetReader.ReadLine(); } else { if(removed.Contains(targetLine)) { // Found targetLine in tentatively removed lines, so it wasn't actually removed. removed.Remove(targetLine); // Since we found something we thought had been removed, we know that all tentatively added lines actually are new. Flush(patchWriter, added, addedSuffix); added.Clear(); targetLine = targetReader.ReadLine(); } else if(added.Contains(sourceLine)) { // Found sourceLine in tentatively added lines, so it wasn't actually added. added.Remove(sourceLine); // We found something we thought had been added, so all candidates for removal should actually be removed. Flush(patchWriter,removed, removedSuffix); removed.Clear(); sourceLine = sourceReader.ReadLine(); } else { // Source and target don't match, so we assume that the source was removed and the target was added. // If we're wrong, we'll clean it up when we come across the line later on. removed.Add(sourceLine); added.Add(targetLine); sourceLine = sourceReader.ReadLine(); targetLine = targetReader.ReadLine(); } } } while(sourceLine != null || targetLine != null); Flush(patchWriter, added, addedSuffix); Flush(patchWriter, removed, removedSuffix); } } public void Flush(StreamWriter writer, IEnumerable<string> lines, string suffix) { foreach (var line in lines.Where (l => l != null)) { writer.WriteLine("{0} {1}", line.Trim(), suffix); } }

Here is the code I used to generate the test files:

 var path = /* path */; var sourcePath = Path.Combine(path, "source.txt"); var targetPath = Path.Combine(path, "target.txt"); var expectedPath = Path.Combine(path, "expected.txt"); var rnd = new Random(10); using(var sourceWriter = new StreamWriter(sourcePath)) using(var targetWriter = new StreamWriter(targetPath)) using(var expectedWriter = new StreamWriter(expectedPath)) { var limit = 10.0 * 100000; for (int i = 0; i < limit; i++) { if(i % 10000 == 0) Console.Write("{0:P0} ...", i / limit); var guid = Guid.NewGuid().ToString(); var r = rnd.Next(0,10); var removed = 3; var added = 6; if(r >= 0 && r < removed) { sourceWriter.WriteLine(guid); expectedWriter.WriteLine(guid + " 0"); } else if(r >= removed && r < added) { targetWriter.WriteLine(guid); expectedWriter.WriteLine(guid + " 1"); } else if(r >= added) { sourceWriter.WriteLine(guid); targetWriter.WriteLine(guid); } } }

See any errors or problems? Is this what you are looking for?

Comparing HUGE ASCII Files

More articles: