Fastest way to convert file from latin1 to utf-8 in python

Question

Fastest way to convert file from latin1 to utf-8 in python

I need a quick way to convert files from latin1 to utf-8 in python. Files are large ~ 2G. (I am moving the database data). Still i

import codecs infile = codecs.open(tmpfile, 'r', encoding='latin1') outfile = codecs.open(tmpfile1, 'w', encoding='utf-8') for line in infile: outfile.write(line) infile.close() outfile.close()

but he is still slow. The conversion takes one quarter of the total migration time.

I could also use linux command line utility if it is faster than python native code.

+4

python

Mike starov Mar 08 '10 at 21:22

source share

3 answers

I would go with iconv and a system call.

+6

user180100 Mar 08 '10 at 21:24

source share

If you are desperate to do this in Python (or any other language), at least do the I / O in larger snippets than strings, and avoid the overhead of codecs.

 infile = open(tmpfile, 'rb') outfile = open(tmpfile1, 'wb') BLOCKSIZE = 65536 # experiment with size while True: block = infile.read(BLOCKSIZE) if not block: break outfile.write(block.decode('latin1').encode('utf8')) infile.close() outfile.close()

Otherwise, go with iconv ... I do not look under the hood, but if this is not a special case of latin1, I would be surprised :-)

+2

John machin Mar 08 '10 at 10:06

source share

Alex martelli · Accepted Answer · 2010-03-08T22:02:30+0000

You can use blocks larger than one line and do binary I / O - each of them can be a little thoughtful (although Linux will not be on binary I / O, since it is identical to text I / O):

  BLOCKSIZE = 1024*1024 with open(tmpfile, 'rb') as inf: with open(tmpfile, 'wb') as ouf: while True: data = inf.read(BLOCKSIZE) if not data: break converted = data.decode('latin1').encode('utf-8') ouf.write(converted)

The step-by-step parsing implied by in-line reading, end-of-line conversion (and not on Linux ;-), and encoding decoding with the extension codecs.open should be part of what slows you down. This approach is also portable (for example, yours), because control characters such as \n , in any case, do not need to be translated among these codecs (in any OS).

This only works for input codecs that do not have multibyte characters, but one of them is "latin1" (it doesn't matter if the output codecs have such characters or not).

Try using different block sizes to find the right place depending on your disk, file system and available RAM.

Edit : changed the code to @John's comment and clarified the condition according to @ gnibbler's.

Fastest way to convert file from latin1 to utf-8 in python

More articles: