Selecting special <tr> tags with BeautifulSoup
I am extracting some rows of the html table using BeautifulSoup using this piece of code:
from bs4 import BeautifulSoup import urllib2 import re page = urllib2.urlopen('www.something.bla') soup = BeautifulSoup(page) rows = soup.findAll('tr', attrs={'class': re.compile('class1.*')}) Here is what I get as a result:
<tr class="class1 class2 class3">...</tr> <tr class="class1 class2 class3">...</tr> <tr class="class1 class5">...</tr> <tr class="class1_a class5_a">...</tr> <tr class="class1 class5">...</tr> <tr class="class1_a class5_a">...</tr> <!-- etc. --> However, I would like to exclude (or not select them first) those lines for which class1 class2 class3 as an attribute.
How can i do this?
Thanks for the help!
+4
1 answer
Perhaps this is easier without regex. This works with BeautifulSoup 3:
from BeautifulSoup import BeautifulSoup page = """ <tr class="class1 class2 class3">1</tr> <tr class="class1 class2 class3">2</tr> <tr class="class1 class5">3</tr> <tr class="class1_a class5_a">4</tr> <tr class="class1 class5">5</tr> <tr class="class1_a class5_a">6</tr> <tr>7</tr>""" def cond(x): if x: return x.startswith("class1") and not "class2 class3" in x else: return False soup = BeautifulSoup(page) rows = soup.findAll('tr', {'class': cond}) for row in rows: print row =>
<tr class="class1 class5">3</tr> <tr class="class1_a class5_a">4</tr> <tr class="class1 class5">5</tr> <tr class="class1_a class5_a">6</tr> With BeautifulSoup 4, I was able to get it to work as follows:
import re from bs4 import BeautifulSoup page = """ <tr class="class1 class2 class3">1</tr> <tr class="class1 class2 class3">2</tr> <tr class="class1 class5">3</tr> <tr class="class1_a class5_a">4</tr> <tr class="class1 class5">5</tr> <tr class="class1_a class5_a">6</tr> <tr>7</tr>""" soup = BeautifulSoup(page) rows = soup.find_all('tr', {'class': re.compile('class1.*')}) for row in rows: cls = row.attrs.get("class") if not ("class2" in cls or "class3" in cls): print row =>
<tr class="class1 class5">3</tr> <tr class="class1_a class5_a">4</tr> <tr class="class1 class5">5</tr> <tr class="class1_a class5_a">6</tr> In BS4, multi-valued attributes such as class have lists of strings as their values, not strings. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#id12 .
+8