What is the fastest way to compare two huge CSV files for change?

Question

What is the fastest way to compare two huge CSV files for change?

I think this is a matter of architecture and / or design:

My scenerio :

"

I export a huge amount of data from Db to CSV.
I do it regularly.
I want to check if the last exported CSV file differs from the contents of the previous exported data "

How can I achieve this (without the need for looping and matching line by line)?

Notes

My exporter is a .Net console application.
My Db is MS-SQL (if you need to know)
My exporter runs regularly as scheduled TASK - in a PowerShell script

+4

design c # .net architecture powershell

pencilCake Nov 08 '12 at 14:06

source share

3 answers

You may have a database that tracks the time of the last change. Just add a trigger to this table, and whenever any item is added / removed / updated, you can set a specific value for the current time. You do not need to compare large files first; your export job can simply request the last modified time, compare it with the last modified file time in the file system, and determine if it needs to be updated.

+1

Servy Nov 08 '12 at 14:38

source share

(It is assumed that you are doing this in Powershell, but these methods apply to any language.)

I recommend checking file sizes first.

Do it first, it's fast!

 if ((gci $file1).Length -ne (gci $file2).Length) { Write-Host "Files are different!" } else { # Same size, so compare contents... }

Finally, you can perform a full-scale comparison. If you are in PowerShell, look at Compare-Object ( diff alias). For instance,

 if (diff (gc $file1) (gc $file2)) { Write-Host "Files are different!" }

It might be faster to do a buffered comparison between bytes, as shown here: http://keestalkstech.blogspot.com/2010/11/comparing-two-files-in-powershell.html

Alternatives:

MD5 comparisons can actually be slower than comparisons between bytes. Not only do you need to read files, but you also need to perform calculations to get a hash. You can at least optimize by caching the hash of the old file - saving half the I / O.

The reason you export the database table is that most databases add rows to the end. You have to make sure that this is your case and that you are just adding and not updating. If so, you can simply compare the last line in your file; for example, the last 4K or some large size of your line.

0

alejandro5042 Nov 09 '12 at 0:53

source share

Rob P. · Accepted Answer · 2012-11-08T14:09:46+0000

It looks like you just want to create a checksum of each CSV file for comparison.
Calculate MD5 checksum for file

using (var md5 = MD5.Create()) { using (var stream = File.OpenRead(filename)) { return md5.ComputeHash(stream); } }

What is the fastest way to compare two huge CSV files for change?

More articles: