It is generally much easier and more reliable to use an HTML parser rather than a regular expression.
Using a third-party lxml module:
import lxml.html as LH content = '''<html><a href="http://www.not-this-one.com"></a> <a href="http://www.somewebsite.com"></a><p>other stuff</p><p>sales</p> </html> ''' doc = LH.fromstring(content) for url in doc.xpath(''' //*[contains(text(),"sales")] /preceding::*[starts-with(@href,"http")][1]/@href'''): print(url)
gives
http:
I find lxml (and XPath) a convenient way to express the elements I'm looking for. However, if installing a third-party module is not an option, you can also do this specific work with HTMLParser from the standard library:
import HTMLParser import contextlib class MyParser(HTMLParser.HTMLParser): def __init__(self): HTMLParser.HTMLParser.__init__(self) self.last_link = None def handle_starttag(self, tag, attrs): attrs = dict(attrs) if 'href' in attrs: self.last_link = attrs['href'] content = '''<html><a href="http://www.not-this-one.com"></a> <a href="http://www.somewebsite.com"></a><p>other stuff</p><p>sales</p> </html> ''' idx = content.find('sales') with contextlib.closing(MyParser()) as parser: parser.feed(content[:idx]) print(parser.last_link)
Regarding the XPath used in the lxml solution: XPath has the following meaning:
//* # Find all elements [contains(text(),"sales")] # whose text content contains "sales" /preceding::*
source share