Is it possible to improve the checksum algorithm?

Question

Is it possible to improve the checksum algorithm?

We have a very old, unsupported program that copies files over shared SMB resources. It has a checksum algorithm to determine if the contents of a file have changed before copying. The algorithm seems easily fooled - we just found an example where two files that are identical, except for one "1" changing to "2", return the same checksum. Here's the algorithm:

unsigned long GetFileCheckSum(CString PathFilename) { FILE* File; unsigned long CheckSum = 0; unsigned long Data = 0; unsigned long Count = 0; if ((File = fopen(PathFilename, "rb")) != NULL) { while (fread(&Data, 1, sizeof(unsigned long), File) != FALSE) { CheckSum ^= Data + ++Count; Data = 0; } fclose(File); } return CheckSum; }

I am not a very programmer (I am a system administrator), but I know that an XOR-based checksum will be pretty crude. What is the likelihood that this algorithm will return the same checksum for two files of the same size with different contents? (I do not expect an exact answer, "remote" or "quite likely" is OK.)

How can this be improved without tremendous success?

Finally, what happens to fread() ? I quickly looked through the documentation, but I could not understand. Is Data set for each byte of the file in turn? Edit: good, so it reads the file in an unsigned long (for example, here is a 32-bit OS). What does each piece contain? If the contents of the file are abcd , what is the value of Data in the first pass? This (in Perl):

 (ord('a') << 24) & (ord('b') << 16) & (ord('c') << 8) & ord('d')

+4

c ++ algorithm checksum

markdrayton Jun 25 '09 at 17:23

source share

8 answers

Robert Harvey · Answer 1 · 2009-06-25T17:43:31+0000

MD5 is commonly used to verify the integrity of transferred files. Source code is easily available in C ++. It is widely considered a fast and accurate algorithm.

See also Reliable and fast checksum algorithm?

Hasturkun · Answer 2 · 2009-06-25T17:58:49+0000

I would advise you to take a look at the Fletcher checksum , in particular fletcher-32, which should be pretty fast and detect various things that the current XOR chain will not have.

schnaader · Answer 3 · 2009-06-25T17:36:56+0000

You can easily improve the algorithm using a formula like this:

 Checksum = (Checksum * a + Data * b) + c;

If a, b, and c are large primes, this should return good results. After that, rotation (not an offset!) The checksum bit will improve it a bit more.

Using prime numbers, this is a similar algorithm used for Linear congruent generators - it guarantees long periods and good distribution.

Jherico · Answer 4 · 2009-06-25T17:39:23+0000

It seems to me that your algorithm does not make any effort to process files that are not an exact multiple of 4 bytes. The return value of fread is not logical, but the number of bytes actually read, which will differ from 4 in the case of EOF or if an error occurs. You are not checked for any, but just think that if it does not return 0, you have 4 valid bytes in the "data" that your hash will calculate.

If you really want to use a hash, I would recommend a few things. First, use a simple cryptographic hash like MD5, not CRC32. CRC32 is suitable for validating data, but in order to cover the file system and ensure no collisions, this is not such an excellent tool due to the paradox of the day mentioned in the comments elsewhere. Secondly, do not record the function yourself. Find an existing implementation. Finally, consider just using rsync to replicate files, rather than deploying your own solution.

BCS · Answer 5 · 2009-06-25T17:41:28+0000

The fread bit reads one piece at a time in a file. Each piece is the size of a long one (in c, this is not a specific size, but you can accept 32 or 64 bits). Depending on how it is buffered, this may not be bad. OTOH, reading a larger piece into an array and looping around it, can be much faster.

erickson · Answer 6 · 2009-06-25T17:44:27+0000

Even “expensive” cryptographic hash functions typically require multiple iterations for a considerable amount of time. Although they are no longer recommended for cryptographic purposes, when users deliberately try to create conflicts, features such as SHA1 and MD5 are widely available and suitable for this purpose.

If a lower hash value is required, CRC is good, but not very good. N-bit CRC will not be able to detect a small portion of changes that are longer than n bits. For example, suppose that only one dollar amount in a file has changed from 12,345 to 34,567 dollars. A 32-bit CRC may skip this change.

Truncating the result of a longer cryptographic hash will determine the changes more reliably than CRC.

Nick dandoulakis · Answer 7 · 2009-06-25T17:50:44+0000

 { CheckSum ^= Data + ++Count; Data = 0; }

I don’t think that “++ Count” works much. Code equivalent

 { CheckSum ^= Data; }

Not enough XOR for a sequence of bytes. Especially with text files. I suggest using a hash function.

user48956 · Answer 8 · 2009-06-25T22:00:20+0000

SHA-1 and (more recently, SHA-2) provide excellent hash functions, and I find it slowly supersedes MD5 due to better hashing properties. All of them (md2, sha, etc.) have efficient implementations and return a buffer hash of several characters in length (although always a fixed length). provably more reliable than reducing a hash to an integer. If I had my drummers, I would use SHA-2. Follow this link for libraries that implement SHA checksums.

If you do not want to compile in these libraries, linux (and possibly cygwin) has the following executables: md5sum, sha1sum, sha224sum, sha256sum, sha384sum, sha512sum; to which you can provide your file, and they print out the checksum as a hexadecimal string. You can use popen to run these programs - with something like this:

 const int maxBuf=1024; char buf[maxBuf]; FILE* f = popen( "sha224sum myfile", "w" ); int bytesRead = f.read( buf, maxBuf ); fclose( f );

Obviously, this will work much slower, but gives a useful first pass. If speed is a problem, given that file hashing operations like this and I / O binding (memory and disk access will be bottlenecks), I would expect all these algorithms to work as fast as for unsigned int Perl and Python also come with MD5 implementations of SHA1 and SHA2 and are likely to work as fast as in C / C ++.

Is it possible to improve the checksum algorithm?

More articles: