We have a very old, unsupported program that copies files over shared SMB resources. It has a checksum algorithm to determine if the contents of a file have changed before copying. The algorithm seems easily fooled - we just found an example where two files that are identical, except for one "1" changing to "2", return the same checksum. Here's the algorithm:
unsigned long GetFileCheckSum(CString PathFilename) { FILE* File; unsigned long CheckSum = 0; unsigned long Data = 0; unsigned long Count = 0; if ((File = fopen(PathFilename, "rb")) != NULL) { while (fread(&Data, 1, sizeof(unsigned long), File) != FALSE) { CheckSum ^= Data + ++Count; Data = 0; } fclose(File); } return CheckSum; }
I am not a very programmer (I am a system administrator), but I know that an XOR-based checksum will be pretty crude. What is the likelihood that this algorithm will return the same checksum for two files of the same size with different contents? (I do not expect an exact answer, "remote" or "quite likely" is OK.)
How can this be improved without tremendous success?
Finally, what happens to fread() ? I quickly looked through the documentation, but I could not understand. Is Data set for each byte of the file in turn? Edit: good, so it reads the file in an unsigned long (for example, here is a 32-bit OS). What does each piece contain? If the contents of the file are abcd , what is the value of Data in the first pass? This (in Perl):
(ord('a') << 24) & (ord('b') << 16) & (ord('c') << 8) & ord('d')
source share