You can use blocks larger than one line and do binary I / O - each of them can be a little thoughtful (although Linux will not be on binary I / O, since it is identical to text I / O):
BLOCKSIZE = 1024*1024 with open(tmpfile, 'rb') as inf: with open(tmpfile, 'wb') as ouf: while True: data = inf.read(BLOCKSIZE) if not data: break converted = data.decode('latin1').encode('utf-8') ouf.write(converted)
The step-by-step parsing implied by in-line reading, end-of-line conversion (and not on Linux ;-), and encoding decoding with the extension codecs.open should be part of what slows you down. This approach is also portable (for example, yours), because control characters such as \n , in any case, do not need to be translated among these codecs (in any OS).
This only works for input codecs that do not have multibyte characters, but one of them is "latin1" (it doesn't matter if the output codecs have such characters or not).
Try using different block sizes to find the right place depending on your disk, file system and available RAM.
Edit : changed the code to @John's comment and clarified the condition according to @ gnibbler's.
source share