Python file checksums

I am creating a file related application. And I was looking for ways to calculate checksums for files. I want to know what is the best hashing method for calculating md5 or SHA-1 file checksums or something else based on these criteria

  • The checksum must be unique. I know its theoretical, but still want the collision probability to be very small.
  • You can compare two files equal if the checksums are equal or not.
  • Speed ​​(not very important, but still)

Please feel free to as much as possible.

+4
source share
4 answers

It depends on your use case.

If you are concerned about an accidental collision, both MD5 and SHA-1 are great, and MD5 is usually faster. In fact, MD4 is also sufficient for most use cases and, as a rule, even faster ... but it is not so widespread. (In particular, it is not in hashlib.algorithms_guaranteed ... although it should be in hashlib_algorithms_available for most versions of Mac, Windows, and Linux.)

On the other hand, if you are concerned about deliberate attacks, that is, someone intentionally creates a dummy file that matches your hash, you must consider the value of what you protect. MD4 is almost certainly not enough, MD5 is probably not enough, but SHA-1 is borderline. It is currently believed that Keccak (which will soon be SHA-3) will be the best choice, but you will want to stay on top of this because everything changes every year.

The Wikipedia page on the Cryptographic Hash Function contains a table that is usually updated quite often. To understand the table:

Only 3 rounds are required to create a collision with MD4, while for MD5 it takes about 2 million, and for SHA-1 it takes 15 trillion. This is enough to cost several million dollars (at today's prices) to trigger a collision. It may or may not be good enough for you, but it is not enough for NIST.


Also, remember that “generally faster” is not as important as “checking my data and platform faster”. With that in mind, in 64-bit Python 3.3.0 on my Mac, I created a 1 MB random bytes object, and then did the following:

 In [173]: md4 = hashlib.new('md4') In [174]: md5 = hashlib.new('md5') In [175]: sha1 = hashlib.new('sha1') In [180]: %timeit md4.update(data) 1000 loops, best of 3: 1.54 ms per loop In [181]: %timeit md5.update(data) 100 loops, best of 3: 2.52 ms per loop In [182]: %timeit sha1.update(data) 100 loops, best of 3: 2.94 ms per loop 

As you can see, md4 much faster than others.

Testing using hashlib.md5() instead of hashlib.new('md5') and using bytes with less entropy (runs 1-8 string.ascii_letters , separated by spaces) did not reveal significant differences.

And for the hash algorithms that came with my installation, as was tested below, do not beat md4.

 for x in hashlib.algorithms_available: h = hashlib.new(x) print(x, timeit.timeit(lambda: h.update(data), number=100)) 

If speed is really important, there is a good trick you can use to improve this: use a bad but very fast hash function like zlib.adler32 , and apply it only to the first 256 KB of each file. (For some file types, the last 256 KB or 256 KB closest to the middle, without transition, etc. It may be better than the first.) Then, if you find a collision, create MD4 / SHA-1 / Keccak / any hashes all file for each file.


Finally, since someone asked in a comment how a hash file is without reading all this in memory:

 def hash_file(path, algorithm='md5', bufsize=8192): h = hashlib.new(algorithm) with open(path, 'rb') as f: block = f.read(bufsize) if not block: break h.update(block) return h.digest() 

If you squeeze every bit of performance, you need to experiment with different values ​​for bufsize on your platform (permissions from 2 to 4 MB). You can also experiment using raw file handles ( os.open and os.read ), which can sometimes be faster on some platforms.

+5
source

The potential for collisions with a hash size of sufficient bits: are theoretically quite small:

Assuming random hash values ​​with uniform distribution, a set of n different data blocks and a hash function that generates bit b, the probability p that there will be one or more collisions is limited by the number of block pairs multiplied by the probability that a given pair will collide, i.e.

enter image description here

And so far, SHA-1 collisions with 160 bits have not been observed. Assuming one exabyte (10 ^ 18) of data, in blocks of 8 KB, the theoretical probability of a collision is 10 ^ -20 - a very small chance.

A useful shortcut is the deletion of files that are known to differ from each other through a short circuit.

For example, in the diagram:

  • Read the first X blocks of all files of interest;
  • Sort the one that has the same hash for the first X blocks, as potentially the same file data;
  • For each file with unique unique X-blogs, you can assume that the entire file is unique compared to other verified files - you do not need to read the rest of this file;
  • With the remaining files, read more blocks until you make sure the signatures are the same or different.

With X blocks of sufficient size, 95% + of the files will be correctly recognized as unique files in the first pass. This is much faster than blindly reading the entire file and calculating the full hash for each file.

+2
source

md5 tends to work fine for checksums ... same with SHA-1 ... both have very little chance of collisions, although I think SHA-1 has a slightly less chance of collisions, since it uses more bits

if you are really worried about this, you can use both checksums (one md5 and one sha1) chance that both the match and the files are infinitely small (still not 100% impossible, but very very unlikely) ... ( this seems like a bad shape and by far the slowest solution)

usually (read: in all cases ever encountered) MD5 OR a SHA1 match is sufficient to suggest uniqueness

there is no way to guarantee 100% uniqueness byte comparison byte

+1
source

I created a small duplicate file deletion script a few days ago that reads the contents of a file and creates a hash for it, and then compares it with the next file, in which even if the name differs from the checksum it will be the same.

 import hashlib import os hash_table = {} dups = [] path = "C:\\images" for img in os.path.listdir(path): img_path = os.path.join(path, img) _file = open(img_path, "rb") content = _file.read() _file.close() md5 = hashlib.md5(content) _hash = md5.hexdigest() if _hash in hash_table.keys(): dups.append(img) else: hash_table[_hash] = img 
0
source

Source: https://habr.com/ru/post/1483210/


All Articles