Python UTF-8 XML Parsing (SUDS): Removing an "invalid token"

Here a common mistake when working with UTF-8 is "invalid tokens"

In my example, it comes from contacting a SOAP service provider that did not respect Unicode characters, simply trimming the values ​​to 100 bytes and neglecting the fact that the 100th byte could be in the middle of a multibyte character: for example:

<name xsi:type="xsd:string">浙江家庭教会五十人遭驱散及抓打 圣诞节聚会被断电及抢走物品(图、视频\xef\xbc</name> 

The last two bytes are what is left of the Unicode character 3 bytes after the truncation knife suggested that the world uses 1-byte characters. Next stop, sax parser and:

 xml.sax._exceptions.SAXParseException: <unknown>:1:2392: not well-formed (invalid token) 

I no longer need this character. It should be removed from the document and allowed to operate the Sax parser.

The XML response is valid in any other way except for these values.

Question: How to delete this character without parsing the entire document and re-creating the UTF-8 encoding to check each byte?

Usage: Python + SUDS

+6
source share
2 answers

It turns out that SUDS sees xml as type 'string' (not unicode), so they are encoded values.

1) FILTER:

 badXML = "your bad utf-8 xml here" #(type <str>) #Turn it into a python unicode string - ignore errors, kick out bad unicode decoded = badXML.decode('utf-8', errors='ignore') #(type <unicode>) #turn it back into a string, using utf-8 encoding. goodXML = decoded.encode('utf-8') #(type <str>) 

2) SUDS: see https://fedorahosted.org/suds/wiki/Documentation#MessagePlugin

 from suds.plugin import MessagePlugin class UnicodeFilter(MessagePlugin): def received(self, context): decoded = context.reply.decode('utf-8', errors='ignore') reencoded = decoded.encode('utf-8') context.reply = reencoded 

and

 from suds.client import Client client = Client(WSDL_url, plugins=[UnicodeFilter()]) 

Hope this helps someone.


Note: thanks to John Machin !

See: Why does a python decoder replace more than invalid bytes from an encoded string?

Python issue8271 regarding errors='ignore' may be here. Without this error fixed in python, "ignore" will consume the next few bytes in order to satisfy the length

during decoding of an invalid UTF-8 byte sequence, only the start byte and continuation byte are considered invalid, instead of the number of bytes specified by the start byte

The problem has been fixed in:
Python 2.6.6 rc1
Python 2.7.1 rc1 (and all future versions 2.7)
Python 3.1.3 rc1 (and all future versions 3.x)

Python 2.5 and below will contain this problem.

In the above example, "\xef\xbc</name".decode('utf-8', errors='ignore') should return "</name" , but in the “bugged” versions of python it returns "/name" .

The first four bits ( 0xe ) describe a three-byte UTF character, so bytes 0xef , 0xbc , and then (erroneously) 0x3c ( '<' ).

0x3c not a valid extension of a byte, which primarily creates an invalid 3-byte UTF character.

Corrected python versions remove only the first byte and only valid continuation bytes, leaving 0x3c unconsumed

+17
source

@FlipMcF is the correct answer. I just post my filter for my solution, because the original did not work for me (I had some emoji characters in my XML that were correctly encoded in UTF-8, but they still broke the XML parsers):

 class UnicodeFilter(MessagePlugin): def received(self, context): from lxml import etree from StringIO import StringIO parser = etree.XMLParser(recover=True) # recover=True is important here doc = etree.parse(StringIO(context.reply), parser) context.reply = etree.tostring(doc) 
0
source

Source: https://habr.com/ru/post/905022/


All Articles