Performing file I / O in Python with non-ASCII characters

I am working on a Python script that reads an XML file encoded using UTF-8, performs some manipulations with it, and saves it to Google Datastore (this is an App Engine program).

The way I read and parse files is just with file.readline () and a few regexes. The only problem is that in the file I'm working with, there are characters from different languages, for example, it can have characters é or Å or Russian or Greek.

At first I got this error: "UnicodeDecodeError: ascii codec cannot decode byte 0xd0 at position 0: serial number is not in the range (128)." Then I tried to switch the encoding to a file opened in "ISO-8859-15", which gets rid of the error, but the displayed characters are not displayed correctly.

So my question is: how do I work with a file encoded in UTF-8 in Python without Python getting stuck in all the special characters in the file? I hope this was clear enough, and well in advance for any advice.

+4
source share
3 answers

Specify UTF-8 encoding on str.decode

 >>> print '\xe2\x99\x9e'.decode('utf-8') ♞ 

It should be a chess piece, but it's too tiny to see :)

+4
source

You say that you have changed the encoding that you use with the file in ISO-8859-1. Have you tried changing it to UTF-8?

+1
source

To expand the answer and with reference to effbot , you can process each line as follows:

 raw = file.readline() proc = raw.decode('utf-8') 
0
source

Source: https://habr.com/ru/post/1336385/


All Articles