It turns out that SUDS sees xml as type 'string' (not unicode), so they are encoded values.
1) FILTER:
badXML = "your bad utf-8 xml here"
2) SUDS: see https://fedorahosted.org/suds/wiki/Documentation#MessagePlugin
from suds.plugin import MessagePlugin class UnicodeFilter(MessagePlugin): def received(self, context): decoded = context.reply.decode('utf-8', errors='ignore') reencoded = decoded.encode('utf-8') context.reply = reencoded
and
from suds.client import Client client = Client(WSDL_url, plugins=[UnicodeFilter()])
Hope this helps someone.
Note: thanks to John Machin !
See: Why does a python decoder replace more than invalid bytes from an encoded string?
Python issue8271 regarding errors='ignore' may be here. Without this error fixed in python, "ignore" will consume the next few bytes in order to satisfy the length
during decoding of an invalid UTF-8 byte sequence, only the start byte and continuation byte are considered invalid, instead of the number of bytes specified by the start byte
The problem has been fixed in:
Python 2.6.6 rc1
Python 2.7.1 rc1 (and all future versions 2.7)
Python 3.1.3 rc1 (and all future versions 3.x)
Python 2.5 and below will contain this problem.
In the above example, "\xef\xbc</name".decode('utf-8', errors='ignore') should return "</name" , but in the “bugged” versions of python it returns "/name" .
The first four bits ( 0xe ) describe a three-byte UTF character, so bytes 0xef , 0xbc , and then (erroneously) 0x3c ( '<' ).
0x3c not a valid extension of a byte, which primarily creates an invalid 3-byte UTF character.
Corrected python versions remove only the first byte and only valid continuation bytes, leaving 0x3c unconsumed