Python 2.7 UnicodeDecodeError: 'ascii' codec cannot decode bytes

Question

Python 2.7 UnicodeDecodeError: 'ascii' codec cannot decode bytes

I parsed some docx files (encoded in UTF-8 format) with special characters (Czech alphabet). When I try to output to stdout, everything goes smoothly, but I cannot output data to a file,

Traceback (last last call):
File "./test.py", line 360, at
ofile.write (and '\ t \ t \ t \ t \ t \ n')
UnicodeEncodeError: codec 'ascii' cannot encode character u '\ xed' at position 37: serial number not in range (128)

Although I explicitly used the word variable for the unicode type(word) ( type(word) returned by unicode), I tried to encode it using .encode('utf-8) . I still stick to this error.

Here is a sample code that now looks:

 for word in word_list: word = unicode(word) #... ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word+u'"/>\n') #...

I also tried the following:

 for word in word_list: word = word.encode('utf-8') #... ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word+u'"/>\n') #...

Even a combination of these two:

 word = unicode(word) word = word.encode('utf-8')

I was desperate, so I even tried to code the word variable inside ofile.write()

 ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word.encode('utf-8')+u'"/>\n')

I would appreciate any hints of what I am doing wrong.

+4

python unicode

rivfaader Nov 22 '12 at 12:07

source share

4 answers

Phihag answer is correct. I just want to suggest converting unicode to a byte string manually with explicit encoding:

 ofile.write((u'\t\t\t\t\t<feat att="writtenForm" val="' + word + u'"/>\n').encode('utf-8'))

(Perhaps you like to know how to do this using basic mechanisms instead of advanced magic and black magic, such as io.open .)

+2

Alfe Nov 22 '12 at 12:32

source share

I had a similar error when writing to text documents (.docx). In particular, with the euro symbol (€).

 x = "€".encode()

Which gave an error:

UnicodeDecodeError: ascii codec cannot decode byte 0xe2 at position 0: serial number not in range (128)

How I solved it:

 x = "€".decode()

Hope this helps!

+2

John paul hayes Nov 30 '14 at 20:49

source share

The best solution I found on stackoverflow is in this post: How to fix it: “UnicodeDecodeError: 'ascii' codec cannot decode bytes” put at the beginning of the code, and the default encoding will be utf8

 # encoding=utf8 import sys reload(sys) sys.setdefaultencoding('utf8')

0

Jose R. Zapata Nov 14 '16 at 12:59

source share

phihag · Accepted Answer · 2012-11-22T12:13:41+0000

ofile is a stream that you write a character string. Therefore, it tries to handle your error by encoding into a byte string. It is safe only with ASCII characters. Since word contains non-ASCII characters, it fails:

 >>> open('/dev/null', 'wb').write(u'ä') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)

Make ofile text stream by opening the io.open file with a mode such as 'wt' and explicit encoding

 >>> import io >>> io.open('/dev/null', 'wt', encoding='utf-8').write(u'ä') 1L

Alternatively, you can also use codecs.open with almost the same interface, or encode all lines manually with encode .

Python 2.7 UnicodeDecodeError: 'ascii' codec cannot decode bytes

More articles: