I am trying to get elements in an HTML document that contain the following text template: # \ S {11}
<h2> this is cool
So the previous would match with:
soup('h2',text=re.compile(r' #\S{11}'))
And the results will be something like this:
[u'blahblah #223409823523', u'thisisinteresting #293845023984']
I can get all the text that matches (see line above). But I want the parent element of the text to match, so I can use this as a starting point for moving around the document tree. In this case, I would like all h2 elements to be returned, not text.
Ideas?
python regex beautifulsoup html-content-extraction
sotangochips May 14 '09 at 9:46 p.m. 2009-05-14 21:46
source share