What is the fastest way to compare two huge CSV files for change?

I think this is a matter of architecture and / or design:

My scenerio :

"

  • I export a huge amount of data from Db to CSV.
  • I do it regularly.
  • I want to check if the last exported CSV file differs from the contents of the previous exported data "

How can I achieve this (without the need for looping and matching line by line)?

Notes

  • My exporter is a .Net console application.

  • My Db is MS-SQL (if you need to know)

  • My exporter runs regularly as scheduled TASK - in a PowerShell script

+4
source share
3 answers

It looks like you just want to create a checksum of each CSV file for comparison.
Calculate MD5 checksum for file

using (var md5 = MD5.Create()) { using (var stream = File.OpenRead(filename)) { return md5.ComputeHash(stream); } } 
+6
source

You may have a database that tracks the time of the last change. Just add a trigger to this table, and whenever any item is added / removed / updated, you can set a specific value for the current time. You do not need to compare large files first; your export job can simply request the last modified time, compare it with the last modified file time in the file system, and determine if it needs to be updated.

+1
source

(It is assumed that you are doing this in Powershell, but these methods apply to any language.)

I recommend checking file sizes first.

Do it first, it's fast!

 if ((gci $file1).Length -ne (gci $file2).Length) { Write-Host "Files are different!" } else { # Same size, so compare contents... } 

Finally, you can perform a full-scale comparison. If you are in PowerShell, look at Compare-Object ( diff alias). For instance,

 if (diff (gc $file1) (gc $file2)) { Write-Host "Files are different!" } 

It might be faster to do a buffered comparison between bytes, as shown here: http://keestalkstech.blogspot.com/2010/11/comparing-two-files-in-powershell.html

Alternatives:

MD5 comparisons can actually be slower than comparisons between bytes. Not only do you need to read files, but you also need to perform calculations to get a hash. You can at least optimize by caching the hash of the old file - saving half the I / O.

The reason you export the database table is that most databases add rows to the end. You have to make sure that this is your case and that you are just adding and not updating. If so, you can simply compare the last line in your file; for example, the last 4K or some large size of your line.

0
source

Source: https://habr.com/ru/post/1444808/


All Articles