Python xml.sax parsing problem with accented characters

The following code causes the well-known "UnicodeDecodeError codec:" ascii "cannot decode" error:

import xml.sax import io parser = xml.sax.make_parser() parser.parse(io.StringIO(u'<a>é</a>')) 

While

 import xml.sax parser = xml.sax.make_parser() parser.parse(open('foo')) 

works (the contents of the file "foo" is <a>é</a> ).

I need to parse an XML string in my case, not a file.

Is there any solution to my problem? Thanks.

+4
source share
1 answer

The file contains bytes and must have some encoding for storing Unicode characters, so instead of BytesIO :

 #coding: utf8 import xml.sax import io parser = xml.sax.make_parser() parser.parse(io.BytesIO(u'<a>é</a>'.encode('utf8'))) 

Note: #coding: utf8 indicates the encoding of the source file; .encode('utf8') specifies the encoding of the Unicode string that will be stored in the BytesIO object. Technically, using a string other than Unicode:

 #coding: utf8 parser.parse(io.BytesIO('<a>é</a>')) 

will also work, since byte strings will already be in the encoding of the source file, but this makes the goal more clear. The source file and BytesIO may be different.

+2
source

Source: https://habr.com/ru/post/1382938/


All Articles