It depends on your use case.
If you are concerned about an accidental collision, both MD5 and SHA-1 are great, and MD5 is usually faster. In fact, MD4 is also sufficient for most use cases and, as a rule, even faster ... but it is not so widespread. (In particular, it is not in hashlib.algorithms_guaranteed ... although it should be in hashlib_algorithms_available for most versions of Mac, Windows, and Linux.)
On the other hand, if you are concerned about deliberate attacks, that is, someone intentionally creates a dummy file that matches your hash, you must consider the value of what you protect. MD4 is almost certainly not enough, MD5 is probably not enough, but SHA-1 is borderline. It is currently believed that Keccak (which will soon be SHA-3) will be the best choice, but you will want to stay on top of this because everything changes every year.
The Wikipedia page on the Cryptographic Hash Function contains a table that is usually updated quite often. To understand the table:
Only 3 rounds are required to create a collision with MD4, while for MD5 it takes about 2 million, and for SHA-1 it takes 15 trillion. This is enough to cost several million dollars (at today's prices) to trigger a collision. It may or may not be good enough for you, but it is not enough for NIST.
Also, remember that “generally faster” is not as important as “checking my data and platform faster”. With that in mind, in 64-bit Python 3.3.0 on my Mac, I created a 1 MB random bytes object, and then did the following:
In [173]: md4 = hashlib.new('md4') In [174]: md5 = hashlib.new('md5') In [175]: sha1 = hashlib.new('sha1') In [180]: %timeit md4.update(data) 1000 loops, best of 3: 1.54 ms per loop In [181]: %timeit md5.update(data) 100 loops, best of 3: 2.52 ms per loop In [182]: %timeit sha1.update(data) 100 loops, best of 3: 2.94 ms per loop
As you can see, md4 much faster than others.
Testing using hashlib.md5() instead of hashlib.new('md5') and using bytes with less entropy (runs 1-8 string.ascii_letters , separated by spaces) did not reveal significant differences.
And for the hash algorithms that came with my installation, as was tested below, do not beat md4.
for x in hashlib.algorithms_available: h = hashlib.new(x) print(x, timeit.timeit(lambda: h.update(data), number=100))
If speed is really important, there is a good trick you can use to improve this: use a bad but very fast hash function like zlib.adler32 , and apply it only to the first 256 KB of each file. (For some file types, the last 256 KB or 256 KB closest to the middle, without transition, etc. It may be better than the first.) Then, if you find a collision, create MD4 / SHA-1 / Keccak / any hashes all file for each file.
Finally, since someone asked in a comment how a hash file is without reading all this in memory:
def hash_file(path, algorithm='md5', bufsize=8192): h = hashlib.new(algorithm) with open(path, 'rb') as f: block = f.read(bufsize) if not block: break h.update(block) return h.digest()
If you squeeze every bit of performance, you need to experiment with different values ​​for bufsize on your platform (permissions from 2 to 4 MB). You can also experiment using raw file handles ( os.open and os.read ), which can sometimes be faster on some platforms.