I like the accepted answer: it is simple and will do the job. I would also suggest an alternative implementation:
def chunks(filename, buffer_size=4096): """Reads `filename` in chunks of `buffer_size` bytes and yields each chunk until no more characters can be read; the last chunk will most likely have less than `buffer_size` bytes. :param str filename: Path to the file :param int buffer_size: Buffer size, in bytes (default is 4096) :return: Yields chunks of `buffer_size` size until exhausting the file :rtype: str """ with open(filename, "rb") as fp: chunk = fp.read(buffer_size) while chunk: yield chunk chunk = fp.read(buffer_size) def chars(filename, buffersize=4096): """Yields the contents of file `filename` character-by-character. Warning: will only work for encodings where one character is encoded as one byte. :param str filename: Path to the file :param int buffer_size: Buffer size for the underlying chunks, in bytes (default is 4096) :return: Yields the contents of `filename` character-by-character. :rtype: char """ for chunk in chunks(filename, buffersize): for char in chunk: yield char def main(buffersize, filenames): """Reads several files character by character and redirects their contents to `/dev/null`. """ for filename in filenames: with open("/dev/null", "wb") as fp: for char in chars(filename, buffersize): fp.write(char) if __name__ == "__main__":
The code I suggest is essentially the same idea as your accepted answer: read the given number of bytes from the file. The difference is that it first reads a good piece of data (4006 is a good standard for X86, but you can try 1024 or 8192; any multiple size of your page), and then it gives the characters in that piece one.
The code I present may be faster for large files. Take for example . These are my temporary results (Mac Book Pro using OS X 10.7.4; so.py is the name I gave to the pasted code):
$ time python so.py 1 2600.txt.utf-8 python so.py 1 2600.txt.utf-8 3.79s user 0.01s system 99% cpu 3.808 total $ time python so.py 4096 2600.txt.utf-8 python so.py 4096 2600.txt.utf-8 1.31s user 0.01s system 99% cpu 1.318 total
Now: do not accept the buffer size of 4096 as universal truth; look at the results that I get for different sizes (buffer size (bytes) and time on the wall (sec)):
2 2.726 4 1.948 8 1.693 16 1.534 32 1.525 64 1.398 128 1.432 256 1.377 512 1.347 1024 1.442 2048 1.316 4096 1.318
As you can see, you can start earning a profit earlier (and my timings are most likely very inaccurate); buffer size is a trade-off between performance and memory. The default value of 4096 is a smart choice, but as always, measure first.
Escualo Oct 06 '14 at 2:20 2014-10-06 02:20
source share