Technically, these are invalid URLs, but they are valid IRIs ( Internationalized Resource Identifiers ) as defined in RFC 3987 .
IRI Encoding Method for URI:
- UTF-8 encodes a path
- % - encode received UTF-8
For example (taken from a related Wikipedia article), this IRI:
https://en.wiktionary.org/wiki/Ῥόδος
... maps this URI:
https://en.wiktionary.org/wiki/%E1%BF%AC%CF%8C%CE%B4%CE%BF%CF%82
I believe that requests processes them out of the box (although only quite recently, and only "partial support" exists before 3.0, and I'm not sure what that means). I'm sure urllib2 in Python2.7 does not do this, and urllib.request in Python 3.6 probably doesn't work either.
Anyway, if your selected HTTP library does not handle IRI, you can do it manually:
def iri_to_uri(iri): p = urllib.parse.urlparse(iri) path = urllib.parse.quote_from_bytes(p.path.encode('utf-8')) p = [:2] + (path,) + p[3:] return urllib.parse.urlunparse(p2)
There are also several third-party libraries for IRI processing, mainly allocated from other projects, such as Twisted and Amara. It might be worth trying PyPI for one, rather than creating one yourself.
Or you may need a higher-level library, such as hyperlink , which handles all complex problems in RFC 3987 (and RFC 3986 , the current version of the specification for URIs that have neither requests 2.x nor Python 3.6 stdlib handle).
If you need to deal with IRI manually, you also have the opportunity to run into IDN internationalized domain names instead of ASCII domain names too, although they are technically not related to specifications. Therefore, you probably want to do something like this:
def iri_to_uri(iri): p = urllib.parse.urlparse(iri) netloc = p.netloc.encode('idna').decode('ascii') path = urllib.parse.quote_from_bytes(p.path.encode('utf-8')) p = [:1] + (netloc, path) + p[3:] return urllib.parse.urlunparse(p2)