Maximum byte limit in updating Python Hashlib module method

Question

Maximum byte limit in updating Python Hashlib module method

I am trying to calculate the hash of an md5 file with the hashlib.md5 () function from the hashlib module.

So, I typed this piece of code:

Buffer = 128 f = open("c:\\file.tct", "rb") m = hashlib.md5() while True: p = f.read(Buffer) if len(p) != 0: m.update(p) else: break print m.hexdigest() f.close()

I noticed that updating the function is faster if I increment the value of the Buffer variable with 64, 128, 256, and so on. Is there an upper limit that I cannot exceed? I suppose this may only be a RAM memory issue, but I don't know.

+2

python

maxim Feb 09 '11 at 18:53

source share

3 answers

jfs · Answer 1 · 2011-02-11 00:38

Large (≈ 2**40 ) block sizes lead to a MemoryError , that is, there is no limit other than the available RAM. bufsize , bufsize other hand, bufsize limited to 2**31-1 on my machine:

 import hashlib from functools import partial def md5(filename, chunksize=2**15, bufsize=-1): m = hashlib.md5() with open(filename, 'rb', bufsize) as f: for chunk in iter(partial(f.read, chunksize), b''): m.update(chunk) return m

A large chunksize can be as slow as a very small one. Measure it.

I find that for files ≈ 10 MB 2**15 chunksize is the fastest for the files I tested.

ulidtko · Answer 2 · 2011-02-09 19:04

To be able to process arbitrarily large files, you need to read them in blocks. The size of such blocks should preferably be equal to 2, and in the case of md5, the smallest possible block consists of 64 bytes (512 bits), since 512-bit blocks are the units on which the algorithm operates.

But if we go beyond this and try to establish an exact criterion, then, say, a 2048-byte block is better than a 4096-byte block ... we are likely to fail. This must be carefully checked and measured, and almost always the value is chosen at will, based on experience.

shang · Answer 3 · 2011-02-09 19:01

The buffer value is the number of bytes that is read and stored in memory immediately, so yes, the only limit is your available memory.

However, large values will not be automatically faster. At some point, you may run into memory swapping problems or other memory allocation slowdowns if the buffer is too large. You must experiment with large and large values until you hit a decreasing recoil in speed.

Maximum byte limit in updating Python Hashlib module method

More articles: