I work for a company that works with ETL in various databases. I am tasked with creating a patch for two complete historical data sets on a client machine, which will then be sent to our servers. This patch must be software so that it can be called from our software.
Datasets are simple text files. We have extraction software running on our client systems to perform extraction. Extract files up to 3 GB +. I have implemented a solution using Microsoft FC.exe, but it has limitations.
I use FC to create a comparison file and then parse it in perl on our side to retrieve deleted / updated records and those that were added.
FC works fine for me as long as a line of text does not exceed 128 characters. When this happens, the output is placed on the next line of the comparison file and appears as an added / deleted entry. I know that I could probably pre-process the files, but this will add a huge amount of time, possibly defeating the target.
I tried using diffutils, but it complains about large files.
I also played with some C # code to implement the repair process myself. This worked fine for small files, but was terribly inefficient at working with large files (tested it on extracting 2.8 GB).
Are there any good command line utilities or C # libraries that I can use to create this patch file? If this is not the case, is there an algorithm I can use to implement it? Keep in mind that records can be updated, added and deleted (I know it also annoys me that DELETE clients record, but not mark them inactive. This is not my control.)
Edit for clarity:
I need to compare two separate database snippets from two different times. Usually it will be about one day.
Given the files below: (they will obviously be much longer and much wider)
Old.txt
a b c d e 1 f 2 5
New.txt
a 3 b c 4 d e 1 f g
Expected Result:
3 added 4 added 2 removed g added 5 removed
source share