How to parse the following HTML in python beautifulsoup?

Question

How to parse the following HTML in python beautifulsoup?

Suppose the following is a subset of an HTML document ... note that multiple tables are repeated, although <a name="1"> may be "2", "3", "4", etc. with different text for each table.

 <table align="center" width="550"> <tr> <td valign="top" width="300"><b>Product:</b></img></td> <td> <a name="1"></a>1) Text Editor <p>An application for the editing of text files.</p> <br> <b>Application Name: Notepad</b> <br> <b>Type: Writing</b> <br><br></td> </tr> </table>

I want to find the tag "a", which is equal to a certain "#" (in this case 1) and be able to somehow get the text: "1) Text editor".

I know that if I beautifully parsed the entire document, I could use something like findAll("table") to give me all the tables, but I don’t know how I can get to this value. I can do something like findAll("a") , but how would I specify a "name" equal to (1 in this case)? Even if I could do this, I would not be able to get to “1” of the text editor ", because the tag" a "is empty. And I also could not get into things like" <b>Application Name: Notepad</b> ".

What is the best solution combined with python / beautifulsoup or if there is a better way to get these “1” text editors and “Application name” and “Type” in the table based on the fact that it is facing <a name="1"></a> ? An example syntax would be great.

+4

python html beautifulsoup

Setsuna Oct 28 '12 at 19:15

source share

2 answers

It looks like you can easily go to the attrs dictionary for matching. This is similar to the name attribute.

http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#arg-attrs

 soup.findAll(attrs={'name' : '1'})

If you haven't already noted this, the documentation contains many really great examples of how to find elements in an HTML document.

+1

dm03514 Oct 28 '12 at 19:25

source share

Zero piraeus · Accepted Answer · 2012-10-28T19:34:36+0000

You can specify attributes with findAll ...

 >>> a = soup.findAll("a", attrs={"name": "1"})[0]

... and then get the next node ...

 >>> a.next u'1) Text Editor\n'

... and the next <b> element ...

 >>> a.findNext("b") <b>Application Name: Notepad</b>

... etc.

By the way, the attrs argument is needed only because name is a special argument to findAll() . If it were some other attribute, you could use, for example.

 >>> a = soup.findAll("a", href="whatever")

How to parse the following HTML in python beautifulsoup?

More articles: