Python utf-8 encoding throws UnicodeDecodeError despite "errors = 'replace'"

Question

Python utf-8 encoding throws UnicodeDecodeError despite "errors = 'replace'"

I am trying to write text and encode it as utf-8, where possible, using the following code:

outf.write((lang_name + "," + (script_name or "") + "\n").encode("utf-8", errors='replace'))

I get the following error:

File "C:\Python27\lib\encodings\cp1252.py", line 15, in decode 
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 6: character maps to <undefined>

I thought part of errors='replace'my coded call would handle this?

fwiw, I just open the file with

outf = open(outfile, 'w')

without explicit declaration of the encoding.

print repr(outf)

gives:

<open file 'myfile.csv', mode 'w' at 0x000000000315E930>

I separated the write statement into a separate concatenation, encoding, and file:

outstr = lang_name + "," + (script_name or "") + "\n"
encoded_outstr = outstr.encode("utf-8", errors='replace')
outf.write(encoded_outstr)

This is the concatenation that throws an exception.

Line through print repr(foo)

lang_name: 'G\xc4\x81ndh\xc4\x81r\xc4\xab'
script_name: u'Kharo\u1e63\u1e6dh\u012b'

Further detective work shows that I can link any of them to a simple ascii string without any difficulty - it puts them on one line that breaks things.

0

python encoding utf-8 cp1252

Purple vermont 08 . '15 17:38

2

RemcoGerlich · Answer 1 · 2015-07-08T19:32:12+0000

, , 'G\xc4\x81ndh\xc4\x81r\xc4\xab' Unicode u'Kharo\u1e63\u1e6dh\u012b'.

, Python 2.7 bytestring, , Unicode. - cp1252 ASCII, , , , ASCII, UTF8.

, , , , , .

, UTF8 , , _:

encoded_outstr = lang_name + b"," + (script_name.encode('utf-8') or b"") + b"\n"

, b"," , Unicode; from __future__ import unicode_literals Python 3, Unicode, .

Mark Ransom · Answer 2 · 2015-07-08T19:32:58+0000

Unicode, Python 2 Unicode. , ASCII, \x80 \xff, , . , can't decode, can't encode - , encode.

decode Unicode , , Unicode.

outstr = lang_name.decode("utf-8") + u"," + (script_name or u"") + u"\n"

Python utf-8 encoding throws UnicodeDecodeError despite "errors = 'replace'"

More articles: