Performing file I / O in Python with non-ASCII characters

Question

Performing file I / O in Python with non-ASCII characters

I am working on a Python script that reads an XML file encoded using UTF-8, performs some manipulations with it, and saves it to Google Datastore (this is an App Engine program).

The way I read and parse files is just with file.readline () and a few regexes. The only problem is that in the file I'm working with, there are characters from different languages, for example, it can have characters é or Å or Russian or Greek.

At first I got this error: "UnicodeDecodeError: ascii codec cannot decode byte 0xd0 at position 0: serial number is not in the range (128)." Then I tried to switch the encoding to a file opened in "ISO-8859-15", which gets rid of the error, but the displayed characters are not displayed correctly.

So my question is: how do I work with a file encoded in UTF-8 in Python without Python getting stuck in all the special characters in the file? I hope this was clear enough, and well in advance for any advice.

+4

python google-app-engine file-io localization

dshipper Jan 20 '11 at 21:26

source share

3 answers

You say that you have changed the encoding that you use with the file in ISO-8859-1. Have you tried changing it to UTF-8?

+1

Nick johnson Jan 20 '11 at 10:43

source share

To expand the answer and with reference to effbot , you can process each line as follows:

 raw = file.readline() proc = raw.decode('utf-8')

0

William Jan 20 '11 at 10:41

source share

Brian goldman · Accepted Answer · 2011-01-20T21:30:38+0000

Specify UTF-8 encoding on str.decode

 >>> print '\xe2\x99\x9e'.decode('utf-8') ♞

It should be a chess piece, but it's too tiny to see :)

Performing file I / O in Python with non-ASCII characters

More articles: