Lxml cannot parse xml (simpler encoding - utf-8 or not) [python]

My code is:

import re import requests from lxml import etree url = 'http://weixin.sogou.com/gzhjs?openid=oIWsFt__d2wSBKMfQtkFfeVq_u8I&ext=2JjmXOu9jMsFW8Sh4E_XmC0DOkcPpGX18Zm8qPG7F0L5ffrupfFtkDqSOm47Bv9U' r = requests.get(url) items = r.json()['items'] 
  • without coding ('utf-8'):

etree.fromstring(items[0]) output:

 ValueError Traceback (most recent call last) <ipython-input-69-cb8697498318> in <module>() ----> 1 etree.fromstring(items[0]) lxml.etree.pyx in lxml.etree.fromstring (src\lxml\lxml.etree.c:68121)() parser.pxi in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:102435)() ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration. 
  1. with encode ('utf-8'):

etree.fromstring(items[0].encode('utf-8')) output:

  File "<string>", line unknown XMLSyntaxError: CData section not finished鎢ζ₯€ε•ΊιŽΆγˆ€ζ«“ιŽΉι”‹ε§€:闃冲寳I绾挎, line 1, column 281 

Do not try to parse this xml ..

+5
source share
1 answer

As a workaround, you can remove the encoding attribute before passing the string to etree.fromstring :

 xml = re.sub(r'\bencoding="[-\w]+"', '', items[0], count=1) root = etree.fromstring(xml) 

UPDATE after watching @Lea's comment in question:

Specify a parser with explicit encoding:

 xml = r.json()['items'].encode('utf-8') root = etree.fromstring(xml, parser=etree.XMLParser(encoding='utf-8')) 
+4
source

Source: https://habr.com/ru/post/1237389/


All Articles