Why does this xpath not work using lxml in python?

Question

Why does this xpath not work using lxml in python?

Here is an example of the webpage I'm trying to get data from. http://www.makospearguns.com/product-p/mcffgb.htm

The xpath was taken from the chrome development tools, and the firepath in firefox can also find it, but using lxml, it just returns an empty list for "text".

from lxml import html import requests site_url = 'http://www.makospearguns.com/product-p/mcffgb.htm' xpath = '//*[@id="v65-product-parent"]/tbody/tr[2]/td[2]/table[1]/tbody/tr/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/div/table/tbody/tr/td/font/div/b/span/text()' page = requests.get(site_url) tree = html.fromstring(page.text) text = tree.xpath(xpath)

Print tree text with

 print(tree.text_content().encode('utf-8'))

shows that there is data, but xpath does not seem to work to find it. Is there something I'm missing? Most of the other sites I tried work fine using lxml and xpath taken from hrome dev tools, but some of them I found give empty lists.

+4

python xml xpath lxml

bltpyro May 27 '14 at 23:06

source share

3 answers

xpath is simply incorrect

Here is a snippet from the page:

 <form id="vCSS_mainform" method="post" name="MainForm" action="/ProductDetails.asp?ProductCode=MCFFGB" onsubmit="javascript:return QtyEnabledAddToCart_SuppressFormIE();"> <img src="/v/vspfiles/templates/MAKO/images/clear1x1.gif" width="5" height="5" alt="" /><br /> <table width="100%" cellpadding="0" cellspacing="0" border="0" id="v65-product-parent"> <tr> <td colspan="2" class="vCSS_breadcrumb_td"><b> &nbsp; <a href="http://www.makospearguns.com/">Home</a> >

You can see that this element with id is "v65-product-parent" is of type table and has subelement tr`.

There can be only one element with such id (otherwise it would be broken xml).

xpath expects tbody be a child of this element (table), but not on the whole page.

This can be checked with

 >>> "tbody" in page.text False

How did Chrome come to this XPath?

If you just load this page

 $ wget http://www.makospearguns.com/product-p/mcffgb.htm

and view its contents, it does not contain any elements named tbody

But if you use the Chrome developer tools, you will find some.

How does this happen?

This often happens if JavaScript enters the game and generates some page content in the browser. But, as LegoStormtroopr noted, this is not our case, and this time it is a browser that modifies the document to make it correct.

How to get page content dynamically changed in browser?

You have to give a kind of browser a chance. For instance. if you use selenium , you will get it.

`byselenium.py`

 from selenium import webdriver from lxml import html url = "http://www.makospearguns.com/product-p/mcffgb.htm" xpath = '//*[@id="v65-product-parent"]/tbody/tr[2]/td[2]/table[1]/tbody/tr/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/div/table/tbody/tr/td/font/div/b/span/text()' browser = webdriver.Firefox() browser.get(url) html_source = browser.page_source print "test tbody", "tbody" in html_source tree = html.fromstring(html_source) text = tree.xpath(xpath) print text

what prints

 $ python byselenimum.py test tbody True ['$149.95']

Conclusion

Selenium is great for browser changes. However, this is a bit of a heavy tool, and if you can make it easier, do it that way. Lego Stormrtoopr came up with such a simpler solution that runs on a simply loaded web page.

+2

Jan Vlcinsky May 27 '14 at 23:26

source share

I had a similar problem (Chrome inserted tbody elements when you do Copy as XPath). Like others, you need to take a look at the actual source of the page, although the XPath provided by the browser is a good place to start. I found that removing tbody tags often fixes it, and to test this, I wrote a small Python script utility to test XPaths:

 #!/usr/bin/env python import sys, requests from lxml import html if (len(sys.argv) < 3): print 'Usage: ' + sys.argv[0] + ' url xpath' sys.exit(1) else: url = sys.argv[1] xp = sys.argv[2] page = requests.get(url) tree = html.fromstring(page.text) nodes = tree.xpath(xp) if (len(nodes) == 0): print 'XPath did not match any nodes' else: # tree.xpath(xp) produces a list, so always just take first item print (nodes[0]).text_content().encode('ascii', 'ignore')

(that Python 2.7, in case the non-function "print" did not give it)

+1

Chirael Jun 21 '15 at 2:52

source share

user764357 · Accepted Answer · 2014-05-28T06:05:28+0000

1. Browsers often change HTML

Browsers often change the HTML that they serve to make it "valid." For example, if you are serving a browser, this invalid HTML:

 <table> <p>bad paragraph</p> <tr><td>Note that cells and rows can be unclosed (and valid) in HTML </table>

To do this, the browser is useful and tries to make it valid HTML and can convert it to:

 <p>bad paragraph</p> <table> <tbody> <tr> <td>Note that cells and rows can be unclosed (and valid) in HTML</td> </tr> </tbody> </table>

The above because <p> cannot be inside <table> and <tbody> . What changes apply to the source can be very different from the browser. Some will put invalid elements in front of tables, some after, some inside cells, etc.

2. Xpaths are not fixed, they are flexible when pointing to elements.

Using this “fixed” HTML:

 <p>bad paragraph</p> <table> <tbody> <tr> <td>Note that cells and rows can be unclosed (and valid) in HTML</td> </tr> </tbody> </table>

If we try to target the text of the <td> cell, all of the following will give you approximately the correct information:

 //td //tr/td //tbody/tr/td /table/tbody/tr/td /table//*/text()

And this list goes on ...

however, in a general browser you will get the most accurate (and least flexible) XPath, which lists all the elements from the DOM. In this case:

 /table[0]/tbody[0]/tr[0]/td[0]/text()

3. Conclusion: the browser provided by Xpaths is usually useless

This is why XPaths created by developer tools often gives you the wrong Xpath when trying to use raw HTML.

The solution always refers to raw HTML and uses flexible but accurate XPath.

Examine the actual HTML that contains the price:

 <table border="0" cellspacing="0" cellpadding="0"> <tr> <td> <font class="pricecolor colors_productprice"> <div class="product_productprice"> <b> <font class="text colors_text">Price:</font> <span itemprop="price">$149.95</span> </b> </div> </font> <br/> <input type="image" src="/v/vspfiles/templates/MAKO/images/buttons/btn_updateprice.gif" name="btnupdateprice" alt="Update Price" border="0"/> </td> </tr> </table>

If you need a price, there really is only one place!

 //span[@itemprop="price"]/text()

And this will return:

 $149.95

Why does this xpath not work using lxml in python?

1. Browsers often change HTML

2. Xpaths are not fixed, they are flexible when pointing to elements.

3. Conclusion: the browser provided by Xpaths is usually useless

How did Chrome come to this XPath?

How to get page content dynamically changed in browser?

byselenium.py

Conclusion

More articles:

`byselenium.py`