Extract table contents from html using python and BeautifulSoup

Question

Extract table contents from html using python and BeautifulSoup

I want to extract certain information from an html document. For instance. it contains a table (among other tables with different contents):

<table class="details"> <tr> <th>Advisory:</th> <td>RHBA-2013:0947-1</td> </tr> <tr> <th>Type:</th> <td>Bug Fix Advisory</td> </tr> <tr> <th>Severity:</th> <td>N/A</td> </tr> <tr> <th>Issued on:</th> <td>2013-06-13</td> </tr> <tr> <th>Last updated on:</th> <td>2013-06-13</td> </tr> <tr> <th valign="top">Affected Products:</th> <td><a href="#Red Hat Enterprise Linux ELS (v. 4)">Red Hat Enterprise Linux ELS (v. 4)</a></td> </tr> </table>

I want to extract information such as the date "Issued on:". Looks like BeautifulSoup4 can do it easily, but somehow I can't fix it. My code is:

  from bs4 import BeautifulSoup soup=BeautifulSoup(unicodestring_containing_the_entire_htlm_doc) table_tag=soup.table if table_tag['class'] == ['details']: print table_tag.tr.th.get_text() + " " + table_tag.tr.td.get_text() a=table_tag.next_sibling print unicode(a) print table_tag.contents

This gets the contents of the first row of the table, as well as a list of contents. But the next brother is not working correctly, I assume that I am just using it incorrectly. Of course, I could just parse the contents, but it seems to me that a wonderful soup was designed to prevent us from doing just that (if I start to understand, I could analyze the whole document well ...). If someone could enlighten me on how to do this, I would appreciate it. If there is a better way, then BeautifulSoup, I would be interested to know about it.

+6

python screen-scraping beautifulsoup

Isaac Jun 19 '13 at 16:04

source share

1 answer

falsetru · Accepted Answer · 2013-06-19T16:43:55+0000

 >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(unicodestring_containing_the_entire_htlm_doc) >>> table = soup.find('table', {'class': 'details'}) >>> th = table.find('th', text='Issued on:') >>> th <th>Issued on:</th> >>> td = th.findNext('td') >>> td <td>2013-06-13</td> >>> td.text u'2013-06-13'

Extract table contents from html using python and BeautifulSoup

More articles: