Regex for getting matches within a group
I do not know if the following is possible. Suppose I have the following text:
<ul class="yes">
<li><img src="whatever1"></li>
<li><img src="whatever2"></li>
<li><img src="whatever3"></li>
<li><img src="whatever4"></li>
</ul>
<ul class="no">
<li><img src="whatever5"></li>
<li><img src="whatever6"></li>
<li><img src="whatever7"></li>
<li><img src="whatever8"></li>
</ul>
I would like to map each img src inside ul to the class yes. I want one regex to return me:
whatever1
whatever2
whatever3
whatever4
How can I append two regular expressions like these in one regular expression?
<ul class="yes">(.+?)<\/ul>
<img src="(whatever.+?)">
+4
1 answer
It is known that Regex is difficult to use for parsing XML-like materials. Better skip the idea and collapse using your own HTML parser, for example using BeautifulSoup4 :
import bs4
html = """
<ul class="yes">
<li><img src="whatever1"></li>
<li><img src="whatever2"></li>
<li><img src="whatever3"></li>
<li><img src="whatever4"></li>
</ul>
<ul class="no">
<li><img src="whatever5"></li>
<li><img src="whatever6"></li>
<li><img src="whatever7"></li>
<li><img src="whatever8"></li>
</ul>
"""
soup = bs4.BeautifulSoup(html)
def match_imgs(tag):
return tag.name == 'img' \
and tag.parent.parent.name == 'ul' \
and tag.parent.parent['class'] == ['yes']
imgs = soup.find_all(match_imgs)
print(imgs)
whatevers = [i['src'] for i in imgs]
print(whatevers)
Productivity:
[<img src="whatever1"/>, <img src="whatever2"/>, <img src="whatever3"/>,
<img src="whatever4"/>]
[u'whatever1', u'whatever2', u'whatever3', u'whatever4']
+1