Python 2.7 UnicodeDecodeError: 'ascii' codec cannot decode bytes

I parsed some docx files (encoded in UTF-8 format) with special characters (Czech alphabet). When I try to output to stdout, everything goes smoothly, but I cannot output data to a file,

Traceback (last last call):
File "./test.py", line 360, at
ofile.write (and '\ t \ t \ t \ t \ t \ n')
UnicodeEncodeError: codec 'ascii' cannot encode character u '\ xed' at position 37: serial number not in range (128)

Although I explicitly used the word variable for the unicode type(word) ( type(word) returned by unicode), I tried to encode it using .encode('utf-8) . I still stick to this error.

Here is a sample code that now looks:

 for word in word_list: word = unicode(word) #... ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word+u'"/>\n') #... 

I also tried the following:

 for word in word_list: word = word.encode('utf-8') #... ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word+u'"/>\n') #... 

Even a combination of these two:

 word = unicode(word) word = word.encode('utf-8') 

I was desperate, so I even tried to code the word variable inside ofile.write()

 ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word.encode('utf-8')+u'"/>\n') 

I would appreciate any hints of what I am doing wrong.

+4
source share
4 answers

ofile is a stream that you write a character string. Therefore, it tries to handle your error by encoding into a byte string. It is safe only with ASCII characters. Since word contains non-ASCII characters, it fails:

 >>> open('/dev/null', 'wb').write(u'ä') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128) 

Make ofile text stream by opening the io.open file with a mode such as 'wt' and explicit encoding

 >>> import io >>> io.open('/dev/null', 'wt', encoding='utf-8').write(u'ä') 1L 

Alternatively, you can also use codecs.open with almost the same interface, or encode all lines manually with encode .

+11
source

Phihag answer is correct. I just want to suggest converting unicode to a byte string manually with explicit encoding:

 ofile.write((u'\t\t\t\t\t<feat att="writtenForm" val="' + word + u'"/>\n').encode('utf-8')) 

(Perhaps you like to know how to do this using basic mechanisms instead of advanced magic and black magic, such as io.open .)

+2
source

I had a similar error when writing to text documents (.docx). In particular, with the euro symbol (€).

 x = "€".encode() 

Which gave an error:

UnicodeDecodeError: ascii codec cannot decode byte 0xe2 at position 0: serial number not in range (128)

How I solved it:

 x = "€".decode() 

Hope this helps!

+2
source

The best solution I found on stackoverflow is in this post: How to fix it: “UnicodeDecodeError: 'ascii' codec cannot decode bytes” put at the beginning of the code, and the default encoding will be utf8

 # encoding=utf8 import sys reload(sys) sys.setdefaultencoding('utf8') 
0
source

Source: https://habr.com/ru/post/1447731/


All Articles