I parsed some docx files (encoded in UTF-8 format) with special characters (Czech alphabet). When I try to output to stdout, everything goes smoothly, but I cannot output data to a file,
Traceback (last last call):
File "./test.py", line 360, at
ofile.write (and '\ t \ t \ t \ t \ t \ n')
UnicodeEncodeError: codec 'ascii' cannot encode character u '\ xed' at position 37: serial number not in range (128)
Although I explicitly used the word variable for the unicode type(word) ( type(word) returned by unicode), I tried to encode it using .encode('utf-8) . I still stick to this error.
Here is a sample code that now looks:
for word in word_list: word = unicode(word) #... ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word+u'"/>\n') #...
I also tried the following:
for word in word_list: word = word.encode('utf-8')
Even a combination of these two:
word = unicode(word) word = word.encode('utf-8')
I was desperate, so I even tried to code the word variable inside ofile.write()
ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word.encode('utf-8')+u'"/>\n')
I would appreciate any hints of what I am doing wrong.
source share