Which checksum algorithm should I use?

I am creating a system that should be able to find if blobs of bytes are updated. Instead of storing the entire block (they can be up to 5 MB), I think I should calculate the checksum, save this and calculate the same checksum a little later to see if the blog is updated.

The goal is to minimize the following (in that order):

  • checksum size
  • time to calculate
  • chance of collisions (2 identical checksums occur even if the content has been modified).

It is acceptable for our system to have a collision of no more than 1/1 000 000. The problem is not security, but simply when updating / detecting errors, so rare collisions are in order. (That's why I tried to minimize).

In addition, we cannot change text frames ourselves.

Of course, md5 , crc or sha1 come to mind, and if I wanted a quick solution, I would go for it. However, more than a quick solution, I'm looking for what can be a comparison of different methods, as well as the pros and cons.

+44
crc md5 sha1 checksum
Nov 20 '10 at 2:09 p.m.
source share
2 answers

I suggest you take a look at this SO , CRC, and MD5 / SHA1 page .
Speed โ€‹โ€‹and collisions are discussed in this other topic .
And as always, Wikipedia is your friend.

If I had to choose, there is an important question: do you want to avoid collisions in any case - or at least the probability was so low that it is close to the probability that the Moon collides with the Earth in the next 5 minutes ?

If yes, select the SHA family.
In your case, I would change the way you check for updates.
For example, an incremental number can be associated with a blob and sent instead of a hash, an update request will be required if the number is different on the other hand. The collision probability in this case goes from ~ 10 ^ -18 to ~ 0 (basically 0 + probability of error) ...

Change the following comments

Found this algorithm, Alder-32, which is good for long messages (MB) with a 32-bit CRC, i.e. around ~ 1/10 ^ 9 (MD5 is 128 bits long).
It is quickly calculated.
Adler-32 . At the bottom there is a sample (link).

+23
Nov 20 '10 at 14:25
source share

Blake2 is the fastest hash function that you can use, and which is mainly used:

BLAKE2 is not only faster than other good hash functions, it is even faster than MD5 or SHA-1 Source

The winner of the SHA-3 contest was the Keccak algorithm, but the still popular implementation is not used by default in GNU / Linux distributions. Instead, Blake2, which was a candidate for the SHA-3 contest, is faster than Keccak and is part of GNU coreutils . So, on your GNU / Linux distribution, you can use b2sum to use the Blake2 hash algorithm.

0
May 21 '17 at 18:05
source share



All Articles