I have been studying unicode and its implementation in Python for two days, and I think I look into what this is about. To make sure, I ask if my assumptions are correct for my current problems.
In Django, forms give me unicode strings, which I suspect are broken. Unicode strings in Python must be encoded in UTF-8, right? After entering the string "fähre" in the text field, the browser sends the string "f% c3% a4hre" in the POST request (checked through wirehark). When I get the value through form.cleaned_data, I get the string u'f \ xa4hre '(note that this is a unicode string). As far as I understand, this is an ISO-8859-1 encoded string that is incorrect. The correct line should be u'f \ xc3 \ xa4hre ', which would be UTF-8 encoded Unicode. Is this a Django bug or is there something wrong with my understanding? To fix the problem, I wrote a function to apply it to any text input of Django forms:
def fix_broken_unicode(s):
return unicode(s.encode(u'utf-8'), u'iso-8859-1')
which makes
>>> fix_broken_unicode(u'f\xa4hre')
u'f\xc3\xa4hre'
, Django.DEFAULT_CHARSET "utf-8" . unicode , - , , , u '...'.
: Dirk sth, , . , , API Twitter .. GET POST, , UTF-8, urllib.urlencode() ( ). pastebin .