How to navigate URLs using \ u in them?

I came across URLs that have Unicode characters inside them, like the following (note that this will not appear on a valid page - this is just an example).

http://my_site_name.com/\u0442\uab86\u0454\uab8eR-\u0454\u043d-\u043c/23795908

How can I decode / encode such a URL using Python so that I can successfully execute an HTTP GET to retrieve data from this webpage?

+5
source share
2 answers

Technically, these are invalid URLs, but they are valid IRIs ( Internationalized Resource Identifiers ) as defined in RFC 3987 .

IRI Encoding Method for URI:

  • UTF-8 encodes a path
  • % - encode received UTF-8

For example (taken from a related Wikipedia article), this IRI:

 https://en.wiktionary.org/wiki/Ῥόδος 

... maps this URI:

 https://en.wiktionary.org/wiki/%E1%BF%AC%CF%8C%CE%B4%CE%BF%CF%82 

I believe that requests processes them out of the box (although only quite recently, and only "partial support" exists before 3.0, and I'm not sure what that means). I'm sure urllib2 in Python2.7 does not do this, and urllib.request in Python 3.6 probably doesn't work either.

Anyway, if your selected HTTP library does not handle IRI, you can do it manually:

 def iri_to_uri(iri): p = urllib.parse.urlparse(iri) path = urllib.parse.quote_from_bytes(p.path.encode('utf-8')) p = [:2] + (path,) + p[3:] return urllib.parse.urlunparse(p2) 

There are also several third-party libraries for IRI processing, mainly allocated from other projects, such as Twisted and Amara. It might be worth trying PyPI for one, rather than creating one yourself.

Or you may need a higher-level library, such as hyperlink , which handles all complex problems in RFC 3987 (and RFC 3986 , the current version of the specification for URIs that have neither requests 2.x nor Python 3.6 stdlib handle).


If you need to deal with IRI manually, you also have the opportunity to run into IDN internationalized domain names instead of ASCII domain names too, although they are technically not related to specifications. Therefore, you probably want to do something like this:

 def iri_to_uri(iri): p = urllib.parse.urlparse(iri) netloc = p.netloc.encode('idna').decode('ascii') path = urllib.parse.quote_from_bytes(p.path.encode('utf-8')) p = [:1] + (netloc, path) + p[3:] return urllib.parse.urlunparse(p2) 
+4
source

Here is an approach that automates the detection and% coding of non-ASCII as in the IRI path and domain sections:

 from urllib.request import quote def iri_to_uri(iri): return ("".join([x if ord(x) < 128 else quote(x) for x in iri])) 
+1
source

Source: https://habr.com/ru/post/1276220/


All Articles