How to send non-ASCII characters using httplib when content-type is "application / xml"

I implemented the Pivotal Tracker API module in Python 2.7. The Pivotal Tracker API expects POST data to be an XML document and "application / xml" will be a content type.

My code uses urlib / httplib to publish the document as shown:

request = urllib2.Request(self.url, xml_request.toxml('utf-8') if xml_request else None, self.headers) obj = parse_xml(self.opener.open(request)) 

This throws an exception if the XML text contains non-ASCII characters:

 File "/usr/lib/python2.7/httplib.py", line 951, in endheaders self._send_output(message_body) File "/usr/lib/python2.7/httplib.py", line 809, in _send_output msg += message_body exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 89: ordinal not in range(128) 

As you can see, httplib._send_output creates an ASCII string for the message payload, apparently because it expects the data to be encoded in the URL (application / x-www-form-urlencoded). It works great with the / xml application if only ASCII characters are used.

Is there an easy way to publish application / xml data containing non-ASCII characters, or will I have to jump through hoops (e.g. using Twistd and a custom vendor for the POST payload)?

+6
source share
4 answers

You mix Unicode and bytes.

 >>> msg = u'abc' # Unicode string >>> message_body = b'\xc5' # bytestring >>> msg += message_body Traceback (most recent call last): File "<input>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 0: ordinal \ not in range(128) 

To fix this, make sure that the content of self.headers correctly encoded, i.e. all keys, values โ€‹โ€‹in headers should be knocked out:

 self.headers = dict((k.encode('ascii') if isinstance(k, unicode) else k, v.encode('ascii') if isinstance(v, unicode) else v) for k,v in self.headers.items()) 

Note. The character encoding of the headers has nothing to do with the character encoding of the body, that is, the XML text can be encoded independently (this is just an octet stream in terms of an HTTP message).

The same applies to self.url if it is of unicode type; convert it to a byte string (using the encoding "ascii").


An HTTP message consists of a start line, "headers", an empty line, and possibly the body of the message , so self.headers used for headers. self.url used to start the line (the http method goes here) and probably for the Host http header (if the client is http / 1.1), the XML text goes into the body of the message (as a binary blob).

It is always safe to use ASCII encoding for self.url (IDNA can be used for domain names other than ascii, the result is also ASCII).

Here rfc 7230 talks about character encoding for the headers :

Historically, HTTP allowed the contents of a text field in the Cipher ISO-8859-1 [ISO-8859-1], which only supports other encodings by using the encoding [RFC2047]. In practice, most HTTP field value headers use only a subset of the US-ASCII [USASCII] encoding. New header fields MUST limit their values โ€‹โ€‹to US-ASCII Octets. The receiver MUST treat other octets in the (obs-text) field as opaque data.

To convert XML to a byte string, see application/xml encoding conditions :

Using UTF-8 without specification, RECOMMENDED for all XML MIME objects.

+7
source

Check if self.url is self.url . If it is unicode, then httplib will treat the data as unicode.

you can force self.url to be unicode encoded then httplib will process all data as unicode

+2
source

Same as JF Sebastian's answer, but I'm adding a new one, so code formatting works (and is more suitable for Google)

Here what happens if you try to mark the mechanization form until the end of the request:

 br = mechanize.Browser() br.select_form(nr=0) br['form_thingy'] = u"Wonderful" headers = dict((k.encode('ascii') if isinstance(k, unicode) else k, v.encode('ascii') if isinstance(v, unicode) else v) for k,v in br.request.headers.items()) br.addheaders = headers req = br.submit() 
+1
source

There are 3 things here.

  • Non Unicode string + Unicode string, the result will be automatically converted to Unicode string.
  • Python 2.7 httplib just uses + to connect the header to the body, which I think is not good practice, we should not trust automatic type conversion. but Python 2.6 httplib is different.
  • The HTTP protocol standard offers ISO-8859-1 encoding for the header, but if you want to put ISO-8859-1 characters, you must encode it as rfc2047 described

A simple solution is to strictly encode both the header and body up to utf-8 before sending.

0
source

Source: https://habr.com/ru/post/900683/


All Articles