Broken unicode strings encoded in UTF-8?

I have been studying unicode and its implementation in Python for two days, and I think I look into what this is about. To make sure, I ask if my assumptions are correct for my current problems.

In Django, forms give me unicode strings, which I suspect are broken. Unicode strings in Python must be encoded in UTF-8, right? After entering the string "fähre" in the text field, the browser sends the string "f% c3% a4hre" in the POST request (checked through wirehark). When I get the value through form.cleaned_data, I get the string u'f \ xa4hre '(note that this is a unicode string). As far as I understand, this is an ISO-8859-1 encoded string that is incorrect. The correct line should be u'f \ xc3 \ xa4hre ', which would be UTF-8 encoded Unicode. Is this a Django bug or is there something wrong with my understanding? To fix the problem, I wrote a function to apply it to any text input of Django forms:

def fix_broken_unicode(s):
    return unicode(s.encode(u'utf-8'), u'iso-8859-1')

which makes

>>> fix_broken_unicode(u'f\xa4hre')
u'f\xc3\xa4hre'

, Django.DEFAULT_CHARSET "utf-8" . unicode , - , , , u '...'.

: Dirk sth, , . , , API Twitter .. GET POST, , UTF-8, urllib.urlencode() ( ). pastebin .

+3
2

u'f\xa4hre' - , -. unicode 0xa4 ä. , ä byte 0xa4 ISO-8859-1.

unicode , - . , 轮渡 u'\u8f6e\u6e21', Unicode. UTF-8 '\xe8\xbd\xae\xe6\xb8\xa1'.

, , .

+4

: unicode, , 255. , , 16 . ISO-8859-1 . , u'f\xa4hre' - \xa4 , Python , ( ) .

UTF-8 - , unicode, "" 8 /. "" ( ) , encode, . ( ).

decode, .

+1

Source: https://habr.com/ru/post/1736270/


All Articles