I assume your XHTML document is encoded in utf-8. The problem is that the encoding is not specified in the HTML document. By default, browsers and lxml.html assume that HTML documents are encoded in ISO-8859-1, so your document is parsed incorrectly. If you open it in your browser, it will also not display correctly.
You can specify the encoding of your document as follows:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>title</title> <meta http-equiv="content-type" content="text/html; charset=utf-8" /> </head>
You can force the encoding used by lxml (for example, you can change the encoding used in your browser):
file = open(fname) filecontents = file.read() filecontents = filecontents.decode("utf-8") htree = lxml.html.fromstring(filecontents) print htree.xpath("//span[@id='demo']")[0].text
source share