Unicode Decoding Problem

Question

Unicode Decoding Problem

This is ridiculous .. I am trying to read geo-search data from openstreetmap. The code that executes the request is as follows:

params = urllib.urlencode({'q': ",".join([e for e in full_address]), 'format': "json", "addressdetails" : "1"}) query = "http://nominatim.openstreetmap.org/search?%s" % params print query time.sleep(5) response = json.loads(unicode(urllib.urlopen(query).read(), "UTF-8"), encoding="UTF-8") print response

The request for Zürich is correctly encoded with UTF-8 URLs. There are no miracles here.

 http://nominatim.openstreetmap.org/search?q=Z%C3%BCrich%2CSWITZERLAND&addressdetails=1&format=json

When I print the answer, u with umlaut encoded latin1 (0xFC)

 [{u'display_name': u'Z\xfcrich, Bezirk Z\xfcrich, Z\xfcrich, Schweiz, Europe', u'place_id': 588094, u'lon': 8.540443

but this is nonsense because openstreetmap returns JSON data in UTF-8

 Connecting to nominatim.openstreetmap.org (nominatim.openstreetmap.org)|128.40.168.106|:80... connected. HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Wed, 26 Jan 2011 13:48:33 GMT Server: Apache/2.2.14 (Ubuntu) Content-Location: search.php Vary: negotiate TCN: choice X-Powered-By: PHP/5.3.2-1ubuntu4.7 Access-Control-Allow-Origin: * Content-Length: 3342 Keep-Alive: timeout=15, max=100 Connection: Keep-Alive Content-Type: application/json; charset=UTF-8 Length: 3342 (3.3K) [application/json]

which is also confirmed by the contents of the file, and then I directly say that it is UTF-8 both in reading and in json analysis.

What's going on here?

EDIT : apparently this is json.loads, which is somehow screwed.

+4

python encoding utf-8 latin1 iso-8859-1

Stefano borini Jan 26 '11 at 13:50

source share

2 answers

The exit is beautiful. Whenever you print data to the console, Python only considers Unicode data when printing the actual string. If you print a Unicode list, each Unicode line is displayed on the console as its repr () function:

 >>> a=u'á' >>> a u'\xe1' >>> print a á >>> [a] [u'\xe1'] >>> print [a] [u'\xe1']

+1

vz0 Jan 26 '11 at 14:05

source share

etarion · Accepted Answer · 2011-01-26T13:54:43+0000

When I go and print the answer, u with umlaut encoded latin1 (0xFC)

You are simply misinterpreting the conclusion. This is a Unicode string (you can specify the u prefix), there is no “attached” encoding - \ xFC means that it is a code number with the number 0xFC, which seems to be U-Umlaut (see http: // www. fileformat.info/info/unicode/char/fc/index.htm ). The reason for this is that the numbering of the first 256 Unicode codes matches the Latin encoding.

In short, you did everything right - you have a unicode object with the desired content (which is independent of the encodings), you can select the desired encoding when you use this content to output somewhere by running unicodestr.encode ("utf-8 ") or using codecs, see http://docs.python.org/howto/unicode.html#reading-and-writing-unicode-data

Unicode Decoding Problem

More articles: