Unicode lxml characters

Question

Unicode lxml characters

I am new to lxml and python. I am trying to parse an html document. When I understand using the standard xml parser, it will spell characters correctly, but I think it cannot be parsed because I find it hard to find it using xpath.

Example file processed:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>title</title> </head> <body> <span id="demo">Garbléd charactérs</span> </body> </html>

Code Analysis:

 from lxml import etree fname = 'output/so-help.html' # parse hparser = etree.HTMLParser() htree = etree.parse(fname, hparser) # garbled htree.write('so-dumpu.html', encoding='utf-8') # targets demo_name = htree.xpath("//span[@id='demo']") # garbled print 'name: "' + demo_name[0].text

Terminal output:

 name: "GarblÃ©d charactÃ©rs

Htree.write output:

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>title</title></head><body> <span id="demo">GarblÃ©d charactÃ©rs</span> </body></html>

+4

python encoding lxml

ryan Mar 05 '12 at 10:05

source share

3 answers

Try changing the output encoding:

 htree.write('so-dumpu.html', encoding='latin1')

and

 print 'name: "' + demo_name[0].text.encode('latin1')

+3

dusan Mar 05 '12 at 22:12

source share

I assume your XHTML document is encoded in utf-8. The problem is that the encoding is not specified in the HTML document. By default, browsers and lxml.html assume that HTML documents are encoded in ISO-8859-1, so your document is parsed incorrectly. If you open it in your browser, it will also not display correctly.

You can specify the encoding of your document as follows:

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>title</title> <meta http-equiv="content-type" content="text/html; charset=utf-8" /> </head>

You can force the encoding used by lxml (for example, you can change the encoding used in your browser):

 file = open(fname) filecontents = file.read() filecontents = filecontents.decode("utf-8") htree = lxml.html.fromstring(filecontents) print htree.xpath("//span[@id='demo']")[0].text

+2

Xion345 Mar 05 '12 at 22:26

source share

Emil m · Accepted Answer · 2012-03-05T22:55:32+0000

the problem was that you were trying to encode already encoded data, you need the parser to decode the data using utf-8. * in the source code try demo_name [0] .text.decode ('utf-8') and you will see

correct way to do this:

 from lxml import etree fname = 'output/so-help.html' # parse hparser = etree.HTMLParser(encoding='utf-8') htree = etree.parse(fname, hparser) # garbled htree.write('so-dumpu.html') # targets demo_name = htree.xpath("//span[@id='demo']") # garbled print 'name: "' + demo_name[0].text

Unicode lxml characters

More articles: