Something interesting happens with Unicode and Python requests

First, note that u'\xc3\xa8'is a python2 unicode string with two code points, Ãand ¨. Further, it should be noted that this '\xc3\xa8'is a python2 byte string that represents the utf8 character encoding è. So, u'\xc3\xa8'and '\xc3\xa8'despite the fact that they look very similar, these are 2 very different animals.

Now, if we try to access the https://www.sainsburys.co.uk/shop/gb/groceries/chablis/chablis-premièr-cru-brocard-75clbrowser, everything should go well.

If I define in an ipython session:

unicode_url = u'https://www.sainsburys.co.uk/shop/gb/groceries/chablis/chablis-premièr-cru-brocard-75cl'

then I can print it and see the same thing that I entered in the browser URL bar, fine. Try and try this with python queries.

Firstly, I naively just trying to throw the URL-address for Unicode, to see whether the requests are simply to deal with him requests.get(unicode_url). No, 404, fine, no problem, URLs should be encoded, so I'm trying requests.get(unicode_url.encode('utf8')). Again 404. No problem, maybe I need to do the URL encoding too, so I try requests.get(urllib.quote(unicode_url.encode('utf8'))).... I don’t like it at all.

However, recalling the similarities between the unicode and byte str objects that I mentioned at the beginning, I also tried:

  requests.get('http://www.sainsburys.co.uk/shop/gb/groceries/chablis/chablis-premièr-cru-brocard-75cl')

and, to my surprise, it works and gives a successful 200.

What happens to the queries here?

EDIT: as in another experiment (in the Scrapy bracket this time)

   from scrapy.http import Request
   unicode_url = u'https://www.sainsburys.co.uk/shop/gb/groceries/chablis/chablis-premièr-cru-brocard-75cl'
   fetch(Request(unicode_url))

Absolutely no problem! So, why does Scrapy and the browser handle this without problems, but not with python requests? and why the alternate URL works in python requests but not in browser or Scrapy.

Latin1 vs UTF8

,

print unicode_url.encode('utf8').decode('latin1')
u'https://www.sainsburys.co.uk/shop/gb/groceries/chablis/chablis-premièr-cru-brocard-75cl'

, , unicode, unicode str like u'\xe8', , latin1, u'è'=u'\xe8' u'\xe8'.encode('latin1') = '\xe8' ( str latin1 , unicode, è)

,

In [95]: print u'è'.encode('utf8').decode('latin1')
è

,

In [94]: print u'è'.encode('latin1').decode('utf8')
è

,

def prepare_url(self, url, params):
    """Prepares the given HTTP URL."""
    #: Accept objects that have string representations.
    #: We're unable to blindly call unicode/str functions
    #: as this will include the bytestring indicator (b'')
    #: on python 3.x.
    #: https://github.com/kennethreitz/requests/pull/2238
    if isinstance(url, bytes):
        url = url.decode('utf8')
    else:
        url = unicode(url) if is_py2 else str(url)

requests/models.py.

+4
2

- :

In [1]: import requests

In [2]: s = requests.Session()

In [3]: unicode_url = u'https://www.sainsburys.co.uk/shop/gb/groceries/chablis/chablis-premièr-cru-brocard-75cl'

In [4]: s.get(unicode_url)
Out[4]: <Response [404]>

In [5]: s.get(unicode_url)
Out[5]: <Response [200]>

, !

, . cookie - 404 - , cookie . cookie 200- -.

, , ; s.get(unicode_url, allow_redirects=False) 200, 302. . , , , , . , - .

. , Chrome . cookie, URL-, 404. , ( cookie , 404)

:

In [11]:   requests.get(u'http://www.sainsburys.co.uk/shop/gb/groceries/chablis/chablis-premièr-cru-brocard-75cl')
Out[11]: <Response [200]>

cookie/. . :

 'Location':  'http://www.sainsburys.co.uk/webapp/wcs/stores/servlet/gb/groceries/chablis/chablis-premi\xc3\xa8r-cru-brocard-75cl?langId=44&storeId=10151&krypto=dZB7Mt97QsHQQ%2BGMpb1iMZwdVfmbg%2BbRUdkh%2FciAItm7%2F4VSUi8NRUiszN3mSofKSCyAv%2F0QRKSsjhHzoo1x7in7Ctd4vzPIDIW5CcjiksLKE48%2BFU9nLNGkVzGj92PknAgP%2FmIFz63xpKhvPkxbJrtUmwi%2FUpbXNW9XIygHyTA%3D&ddkey=http%3Agb%2Fgroceries%2Fchablis%2Fchablis-premi%C3%83%C2%A8r-cru-brocard-75cl'

. utf8 u'è ' .

, (, ), , - Chrome, Scrapy, python- - -, latin1 , , URL-, utf8, 404, URL- , URL, utf8. utf8, , URL.

, u'http://www.sainsburys.co.uk/shop/gb/groceries/chablis/chablis-premièr-cru-brocard-75cl', utf8 , u'https://www.sainsburys.co.uk/shop/gb/groceries/chablis/chablis-premièr-cru-brocard-75cl'.encode('latin1') - 'https://www.sainsburys.co.uk/shop/gb/groceries/chablis/chablis-premi\xc3\xa8r-cru-brocard-75cl', , , utf8 encoded byte str u'https://www.sainsburys.co.uk/shop/gb/groceries/chablis/chablis-premièr-cru-brocard-75cl', .

cookie, URL- , , - -, .

. https://github.com/kennethreitz/requests/blob/eae38b8d131e8b51c3daf3583e69879d1c02f9a4/requests/sessions.py#L101-L114, python3.

0

, URL- latin1 "è", - Python 2, " URL- ", , , "è" utf-8 - 404.

unicode_url latin1 request.get - unicode "", utf = 8, "è" latin-1 ( "\xe8" char).

ppint , Python 3 - - , - Python 3 :

In [13]: requests.get(unicode_url)
Out[13]: <Response [200]>

, Python 2.7 seens - , , . Python2 unicode_url, -1, urllib.open - Python 2, , :

In [28]: a  = urllib.urlopen(unicode_url.encode("latin1"))

In [29]: a.code
Out[29]: 200

( - script, - , Python 3.6 - , )

0

Source: https://habr.com/ru/post/1677049/


All Articles