Python UnicodeDecodeError when writing German letters

Question

Python UnicodeDecodeError when writing German letters

I have been knocking on this error for several seconds, and I cannot find a solution anywhere in SO, although there are similar questions.

Here is my code:

f = codecs.open(path, "a", encoding="utf-8") value = "Bitte überprüfen" f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))

And what I get as en error:

 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128)

Why ascii if i say utf-8? I would really appreciate any help.

+6

python encoding

Banana Mar 04 '15 at 10:39

source share

4 answers

JuniorCompressor · Answer 1 · 2015-03-04T10:41:49+0000

Try:

 value = u"Bitte überprüfen"

to declare the value as a unicode string and

 # -*- coding: utf-8 -*-

at the beginning of your file to declare that your python file is saved using utf-8 encoding.

zmo · Answer 2 · 2015-03-04T10:47:01+0000

To never damage Unicode errors, go to python3:

 % python3 >>> with open('/tmp/foo', 'w') as f: ... value = "Bitte überprüfen" ... f.write(('"{}" = "{}";\n'.format('no_internet', value))) ... 36 >>> import sys >>> sys.exit(0) % cat /tmp/foo "no_internet" = "Bitte überprüfen";

although if you are really attached to python2 and have no choice:

 % python2 >>> with open('/tmp/foo2', 'w') as f: ... value = u"Bitte überprüfen" ... f.write(('"{}" = "{}";\n'.format('no_internet', value.encode('utf-8')))) ... >>> import sys >>> sys.exit(0) % cat /tmp/foo2 "no_internet" = "Bitte überprüfen";

And as @JuniorCompressor suggests, don't forget to add # encoding: utf-8 at the beginning of your python2 file to tell python to read the source file in unicode, not ASCII!

Your mistake:

 f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))

is that you encode the entire formatted string in utf-8, while you must encode the string of values in utf-8 before executing the format:

 >>> with open('/tmp/foo2', 'w') as f: ... value = u"Bitte überprüfen" ... f.write(('"{}" = "{}";\n'.format('no_internet', value).encode('utf-8'))) ... Traceback (most recent call last): File "<stdin>", line 3, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 6: ordinal not in range(128)

Which is because python has to decode the string in utf-8 , so you need to use the unicode type (which is what u"" does). Then you need to explicitly decode this value as unicode before feeding it to the format parser to build a new line.

As Carl says in his answer, Python2 is completely messy / flawed when using Unicode strings, defeating Explicit is better than implicit zen python. And for a weirder behavior, the following works fine in python2:

 >>> value = "Bitte überprüfen" >>> out = '"{}" = "{}";\n'.format('no_internet', value) >>> out '"no_internet" = "Bitte \xc3\xbcberpr\xc3\xbcfen";\n' >>> print(out) "no_internet" = "Bitte überprüfen";

Not sure to switch to python3? :-)

Update:

This is the way to go about reading and writing a Unicode string from a file to another file:

  % echo "Bitte überprüfen" > /tmp/foobar % python2 >>> with open('/tmp/foobar', 'r') as f: ... data = f.read().decode('utf-8').strip() ... >>> >>> with open('/tmp/foo2', 'w') as f: ... f.write(('"{}" = "{}";\n'.format('no_internet', data.encode('utf-8')))) ... >>> import sys;sys.exit(0) % cat /tmp/foo2 "no_internet" = "Bitte überprüfen";

Update:

as a general rule:

when you get DecodeError you should use .decode('utf-8') in a string containing unicode data and
when you get an EncodeError you should use .encode('utf-8') in a line containing unicode data li>

Update: if you cannot upgrade to python3, you can at least make your python2 behave as if it is almost python3 using the following python-future import:

 from __future__ import absolute_import, division, print_function, unicode_literals

NTN

Karl Knechtel · Answer 3 · 2015-03-04T10:55:08+0000

Why ascii if i say utf-8?

Because in Python 2, "Bitte überprüfen" not a Unicode string. Before it can be .encode d your explicit call, Python must implicitly decode to use it in Unicode (this also causes a Unicode decode Error decode ), and it selects ASCII because it has no other information to work with. ü is represented by some byte with a value> = 128, so it is not valid ASCII.

The u prefix shown by @JuniorCompressor will make it a Unicode string, and you must also specify the encoding for the file (not only blindly install utf-8, it must match what your text editor .py file sa with!).

Migrating to Python 3 is a realistic (part) best long-term solution :), but it's still important to understand the problem. See http://bit.ly/unipain for more details. Python 2's behavior is indeed a mistake, or at least an inability to fulfill Pythonic's design principles: Explicit is better than implicit , and here we see why it is very clear;)

user937284 · Answer 4 · 2015-03-04T14:33:41+0000

As your errors from this line have already been suggested:

 f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))

it should be:

 f.write(('"{}" = "{}";\n'.format('no_internet', value.encode('utf-8'))))

Unicode and Encoding Note

If woking with Python 2, the software should only work with unicode strings inside, converting to a specific encoding in the output.

Don't let the same mistake be repeated over and over, you have to make sure that you understand the difference between ascii and utf-8 , and also between str and unicode in Python.

The difference between ASCII and UTF-8 encoding:

Ascii requires only one byte to represent all possible ascii encoded / encoded characters. Up to four bytes are required to represent the full encoding of UTF-8.

 ascii (default) 1 If the code point is < 128, each byte is the same as the value of the code point. 2 If the code point is 128 or greater, the Unicode string can't be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.) utf-8 (unicode transformation format) 1 If the code point is <128, it's represented by the corresponding byte value. 2 If the code point is between 128 and 0x7ff, it's turned into two byte values between 128 and 255. 3 Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.

The difference between str and unicode objects:

You can say that str is a byte byte string, and unicode is a Unicode string. Both can have a different encoding, such as ascii or utf-8.

 str vs. unicode 1 str = byte string (8-bit) - uses \x and two digits 2 unicode = unicode string - uses \u and four digits 3 basestring /\ / \ str unicode

If you follow a few simple rules, you should be good at handling str / unicode objects in different encodings, such as ascii or utf-8, or any other encoding you should use:

 Rules 1 encode(): Gets you from Unicode -> bytes encode([encoding], [errors='strict']), returns an 8-bit string version of the Unicode string, 2 decode(): Gets you from bytes -> Unicode decode([encoding], [errors]) method that interprets the 8-bit string using the given encoding 3 codecs.open(encoding="utf-8″): Read and write files directly to/from Unicode (you can use any encoding, not just utf-8, but utf-8 is most common). 4 u": Makes your string literals into Unicode objects rather than byte sequences. 5 unicode(string[, encoding, errors])

Warning: do not use encode () in bytes or decode () in Unicode objects

And again: the software should only work with Unicode strings inside, converting it to a specific encoding in the output.

Python UnicodeDecodeError when writing German letters

More articles: