How to get HTML attributes in nested tags using Mechanize in Python?

Question

How to get HTML attributes in nested tags using Mechanize in Python?

everything. I'm having trouble getting links in nested HTML using Mechanize in Python. Here is my current code (I tried everything, this is only the last copy that doesn’t work like that) (and please forgive my variable names (thing, material)):

soup = BeautifulSoup(resultsPage) if not soup.find(attrs={'class' : 'paging'}): print "Only one producted listed!" else: stuff = soup.find('div', attrs={'class' : 'paging'}).ul.li for thing in stuff: print thing

Here is the HTML I'm looking at:

 <div class="paging"> <ul> <li>< </li> <li class='on'> 1-10 </li> <li class=''> <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl01_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&amp;brandid=22&amp;searchtext=jell-o&amp;pageno=2">11-20</a> </li> <li class=''> <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl02_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&amp;brandid=22&amp;searchtext=jell-o&amp;pageno=3">21-30</a> </li> <li class=''> <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl03_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&amp;brandid=22&amp;searchtext=jell-o&amp;pageno=4">31-40</a> </li> <li class=''> <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl04_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&amp;brandid=22&amp;searchtext=jell-o&amp;pageno=5">41-50</a> </li> <li class=''> <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl05_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&amp;brandid=22&amp;searchtext=jell-o&amp;pageno=6">51-60</a> </li> <li> <a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_lnkNext" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&amp;brandid=22&amp;searchtext=jell-o&amp;pageno=7">>></a> </li> </ul>

I need to determine if there are <li> tags with hyperlinks in them; if there is, I need to save them for later click. This is the page where the code came from, if you are interested: http://www.kraftrecipes.com/Products/ProductInfoSearchResults.aspx?CatalogType=1&BrandId=22&SearchText=Jell-O&PageNo=1 I'm working on something to clear websites product information, and I need to be able to navigate the search results.

I have another quick question. Is it good to bind tags and similar queries?

 ingredients = soup.find(attrs={'class' : "TitleAndDescription"}).div.find(text=re.compile("Ingredients")).next

I am just learning Python, but it is like kludge-y, and I would like to know what you guys think. Here is an example of HTML that I scraped:

 <table> <tr> <td> <div id="contHeader" class="TitleAndDescription"> <h1>JELL-O - GELATIN DESSERT - RASPBERRY</h1> <div class="textArea"> <strong>Ingredients:</strong> SUGAR, GELATIN, ADIPIC ACID (FOR TARTNESS), CONTAINS LESS THAN 2% OF ARTIFICIAL FLAVOR, DISODIUM PHOSPHATE AND SODIUM CITRATE (CONTROL ACIDITY), FUMARIC ACID (FOR TARTNESS), RED 40.<br/> <strong>Size:</strong> 6 OZ<br/><strong>Upc:</strong> 4300020052<br/> <br/> <!--<br/>--> <br/> </div> </div> ... </td> ... </tr> ... </table>

Sorry for the text wall. Let me know if you need more information.

Thanks.

+4

python html beautifulsoup

user1074600 Dec 6 '11 at 5:05

source share

2 answers

Ameet · Answer 1 · 2012-05-08T06:23:34+0000

The "HTMLParser module" python may be one of the solutions to the problem. See http://docs.python.org/library/htmlparser.html for details

jcollado · Answer 2 · 2011-12-12T13:30:34+0000

If I understand correctly, then what you want to get is a list of all li tags that contain the a tag (regardless of depth in the DOM tree). If this is correct, you can do something like this:

 from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(resultsPage) list_items = [list_item for list_item in soup.findAll('li') if list_item.findAll('a')]

How to get HTML attributes in nested tags using Mechanize in Python?

More articles: