Python URL Encoding / Decoding

Question

Python URL Encoding / Decoding

I am trying to encode and store and decode arguments in Python and get lost somewhere along the way. Here are my steps:

1) I am using the google toolkit gtm_stringByEscapingForURLArgument to properly convert an NSString to pass HTTP arguments.

2) On my server (python), I store these string arguments as something like u'1234567890-/:;()$&@".,?!\'[]{}#%^*+=_\\|~<>\u20ac\xa3\xa5\u2022.,?!\'' (note that these are standard keys on the iphone keyboard in the “123” and “# + =” views, there are some \u and \x characters in it money prefixes such as pound, yen, etc.)

3) I call urllib.quote(myString,'') on this stored value, presumably to have% -escape them for transport on the client so that the client can slip away from them.

As a result, I get an exception when I try to write the result of% escaping. Is there any important step that I am losing sight of that needs to be applied to the stored value using the \ u and \ x format in order to properly convert it for sending via http?

Update . The sentence, marked as the answer below, worked for me. However, I am providing some updates to satisfy the comments below.

The exception I received caused a problem with \u20ac . I do not know if the problem was with this specifically, and not the fact that it was the first unicode character in the string.

That \u20ac char is unicode for the euro character. Basically, I found that I would have problems with it if I had not used the urllib2 quote method.

+45

python url-encoding

Joey Aug 25 2018-10-10T00:

source share

3 answers

I want to repeat the second remark. Web protocols have evolved over decades, and working with different sets of conventions can be cumbersome. now urls are not explicitly defined for characters, but only for bytes (octets). as a historical coincidence, URLs are one place where you can only guess, but not apply, or reliably expect the encoding to be present. however, there is an agreement on the preference of Latin-1 and utf-8 over other encodings. for a while it looked like this: ' unicode percent escapes ' will be the future, but they never came across.

in this area, it is extremely important to be meticulously picky about the difference between unicode objects and the str octet (in Python <3.0, that is, vaguely, str unicode objects and bytes / bytearray objects in Python> = 3.0). Unfortunately, in my experience, for a number of reasons, it is quite difficult to make a clean separation of the two concepts in Python 2.x.

even more OT, when you want to receive third-party HTTP requests, you cannot completely rely on URLs sent in octets with percentage escaping, with octets with utf-8: there may be a random %uxxxx escape in there, and at least least firefox 2.x is used to encode URLs as latin-1, where possible, and as utf-8 only where necessary.

+4

flow Aug 25 '10 at 14:40

source share

You are out of luck with stdlib, urllib.quote does not work with unicode. If you use django, you can use django.utils.http.urlquote, which works correctly with unicode

+2

almir karic Aug 25 '10 at 6:33

source share

pycruft · Accepted Answer · 2010-08-25 11:48

The raw Unicode URL does not really make sense. First you need .encode("utf8") so that you have a well-known byte encoding, and then .quote() .

The result is not very beautiful, but it should be the correct uri encoding.

 >>> s = u'1234567890-/:;()$&@".,?!\'[]{}#%^*+=_\|~<>\u20ac\xa3\xa5\u2022.,?!\'' >>> urllib2.quote(s.encode("utf8")) '1234567890-/%3A%3B%28%29%24%26%40%22.%2C%3F%21%27%5B%5D%7B%7D%23%25%5E%2A%2B%3D_%5C%7C%7E%3C%3E%E2%82%AC%C2%A3%C2%A5%E2%80%A2.%2C%3F%21%27'

Remember that you will need both unquote() and decode() to print it correctly if you are debugging or something else.

 >>> print urllib2.unquote(urllib2.quote(s.encode("utf8"))) 1234567890-/:;()$&@".,?!'[]{}#%^*+=_\|~<>â‚¬Â£Â¥â€¢.,?!' >>> # oops, nasty Â means we've got a utf8 byte stream being treated as an ascii stream >>> print urllib2.unquote(urllib2.quote(s.encode("utf8"))).decode("utf8") 1234567890-/:;()$&@".,?!'[]{}#%^*+=_\|~<>€£¥•.,?!'

This is essentially what the django functions mentioned in another answer do.

The functions django.utils.http.urlquote () and django.utils.http.urlquote_plus () are versions of the Pythons standard urllib.quote () and urllib.quote_plus () that work with non-ASCII characters. (Data is converted to UTF-8 earlier for encoding.)

Be careful if you use any additional quotation marks or encodings so as not to interfere with things.

Python URL Encoding / Decoding

More articles: