Selecting special <tr> tags with BeautifulSoup

I am extracting some rows of the html table using BeautifulSoup using this piece of code:

from bs4 import BeautifulSoup import urllib2 import re page = urllib2.urlopen('www.something.bla') soup = BeautifulSoup(page) rows = soup.findAll('tr', attrs={'class': re.compile('class1.*')}) 

Here is what I get as a result:

 <tr class="class1 class2 class3">...</tr> <tr class="class1 class2 class3">...</tr> <tr class="class1 class5">...</tr> <tr class="class1_a class5_a">...</tr> <tr class="class1 class5">...</tr> <tr class="class1_a class5_a">...</tr> <!-- etc. --> 

However, I would like to exclude (or not select them first) those lines for which class1 class2 class3 as an attribute.

How can i do this?
Thanks for the help!

+4
source share
1 answer

Perhaps this is easier without regex. This works with BeautifulSoup 3:

 from BeautifulSoup import BeautifulSoup page = """ <tr class="class1 class2 class3">1</tr> <tr class="class1 class2 class3">2</tr> <tr class="class1 class5">3</tr> <tr class="class1_a class5_a">4</tr> <tr class="class1 class5">5</tr> <tr class="class1_a class5_a">6</tr> <tr>7</tr>""" def cond(x): if x: return x.startswith("class1") and not "class2 class3" in x else: return False soup = BeautifulSoup(page) rows = soup.findAll('tr', {'class': cond}) for row in rows: print row 

=>

 <tr class="class1 class5">3</tr> <tr class="class1_a class5_a">4</tr> <tr class="class1 class5">5</tr> <tr class="class1_a class5_a">6</tr> 

With BeautifulSoup 4, I was able to get it to work as follows:

 import re from bs4 import BeautifulSoup page = """ <tr class="class1 class2 class3">1</tr> <tr class="class1 class2 class3">2</tr> <tr class="class1 class5">3</tr> <tr class="class1_a class5_a">4</tr> <tr class="class1 class5">5</tr> <tr class="class1_a class5_a">6</tr> <tr>7</tr>""" soup = BeautifulSoup(page) rows = soup.find_all('tr', {'class': re.compile('class1.*')}) for row in rows: cls = row.attrs.get("class") if not ("class2" in cls or "class3" in cls): print row 

=>

 <tr class="class1 class5">3</tr> <tr class="class1_a class5_a">4</tr> <tr class="class1 class5">5</tr> <tr class="class1_a class5_a">6</tr> 

In BS4, multi-valued attributes such as class have lists of strings as their values, not strings. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#id12 .

+8
source

Source: https://habr.com/ru/post/1396085/


All Articles