Using BeautifulSoup to search for an HTML tag that contains specific text

Question

Using BeautifulSoup to search for an HTML tag that contains specific text

I am trying to get elements in an HTML document that contain the following text template: # \ S {11}

<h2> this is cool #12345678901 </h2>

So the previous would match with:

 soup('h2',text=re.compile(r' #\S{11}'))

And the results will be something like this:

 [u'blahblah #223409823523', u'thisisinteresting #293845023984']

I can get all the text that matches (see line above). But I want the parent element of the text to match, so I can use this as a starting point for moving around the document tree. In this case, I would like all h2 elements to be returned, not text.

Ideas?

+48

python regex beautifulsoup html-content-extraction

sotangochips May 14 '09 at 9:46 p.m.

source share

2 answers

BeautifulSoup search operations deliver [list] of BeautifulSoup.NavigableString objects when text= used as a criterion, unlike BeautifulSoup.Tag in other cases. Check the __dict__ object to see the attributes available to you. Of these attributes, parent preferred over previous because of changes to BS4 .

 from BeautifulSoup import BeautifulSoup from pprint import pprint import re html_text = """ <h2>this is cool #12345678901</h2> <h2>this is nothing</h2> <h2>this is interesting #126666678901</h2> <h2>this is blah #124445678901</h2> """ soup = BeautifulSoup(html_text) # Even though the OP was not looking for 'cool', it more understandable to work with item zero. pattern = re.compile(r'cool') pprint(soup.find(text=pattern).__dict__) #>> {'next': u'\n', #>> 'nextSibling': None, #>> 'parent': <h2>this is cool #12345678901</h2>, #>> 'previous': <h2>this is cool #12345678901</h2>, #>> 'previousSibling': None} print soup.find('h2') #>> <h2>this is cool #12345678901</h2> print soup.find('h2', text=pattern) #>> this is cool #12345678901 print soup.find('h2', text=pattern).parent #>> <h2>this is cool #12345678901</h2> print soup.find('h2', text=pattern) == soup.find('h2') #>> False print soup.find('h2', text=pattern) == soup.find('h2').text #>> True print soup.find('h2', text=pattern).parent == soup.find('h2') #>> True

+11

Bruno Bronosky Nov 12 '12 at 18:05

source share

nosklo · Accepted Answer · 2009-05-14 21:53

 from BeautifulSoup import BeautifulSoup import re html_text = """ <h2>this is cool #12345678901</h2> <h2>this is nothing</h2> <h1>foo #126666678901</h1> <h2>this is interesting #126666678901</h2> <h2>this is blah #124445678901</h2> """ soup = BeautifulSoup(html_text) for elem in soup(text=re.compile(r' #\S{11}')): print elem.parent

Print

 <h2>this is cool #12345678901</h2> <h2>this is interesting #126666678901</h2> <h2>this is blah #124445678901</h2>

Using BeautifulSoup to search for an HTML tag that contains specific text

More articles: