Unicode lxml characters

I am new to lxml and python. I am trying to parse an html document. When I understand using the standard xml parser, it will spell characters correctly, but I think it cannot be parsed because I find it hard to find it using xpath.

Example file processed:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>title</title> </head> <body> <span id="demo">Garbléd charactérs</span> </body> </html> 

Code Analysis:

 from lxml import etree fname = 'output/so-help.html' # parse hparser = etree.HTMLParser() htree = etree.parse(fname, hparser) # garbled htree.write('so-dumpu.html', encoding='utf-8') # targets demo_name = htree.xpath("//span[@id='demo']") # garbled print 'name: "' + demo_name[0].text 

Terminal output:

 name: "Garbléd charactérs 

Htree.write output:

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>title</title></head><body> <span id="demo">Garbléd charactérs</span> </body></html> 
+4
source share
3 answers

the problem was that you were trying to encode already encoded data, you need the parser to decode the data using utf-8. * in the source code try demo_name [0] .text.decode ('utf-8') and you will see

correct way to do this:

 from lxml import etree fname = 'output/so-help.html' # parse hparser = etree.HTMLParser(encoding='utf-8') htree = etree.parse(fname, hparser) # garbled htree.write('so-dumpu.html') # targets demo_name = htree.xpath("//span[@id='demo']") # garbled print 'name: "' + demo_name[0].text 
+7
source

Try changing the output encoding:

 htree.write('so-dumpu.html', encoding='latin1') 

and

 print 'name: "' + demo_name[0].text.encode('latin1') 
+3
source

I assume your XHTML document is encoded in utf-8. The problem is that the encoding is not specified in the HTML document. By default, browsers and lxml.html assume that HTML documents are encoded in ISO-8859-1, so your document is parsed incorrectly. If you open it in your browser, it will also not display correctly.

You can specify the encoding of your document as follows:

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>title</title> <meta http-equiv="content-type" content="text/html; charset=utf-8" /> </head> 

You can force the encoding used by lxml (for example, you can change the encoding used in your browser):

 file = open(fname) filecontents = file.read() filecontents = filecontents.decode("utf-8") htree = lxml.html.fromstring(filecontents) print htree.xpath("//span[@id='demo']")[0].text 
+2
source

Source: https://habr.com/ru/post/1399885/


All Articles