Search for anchor text with tags

Question

Search for anchor text with tags

I want to find text between a pair of <a> tags that link to this site

Here's the re line that I use to find the content:

r'''(<a([^<>]*)href=("|')(http://)?(www\.)?%s([^'"]*)("|')([^<>]*)>([^<]*))</a>''' % our_url

The result will be something like this:

r'''(<a([^<>]*)href=("|')(http://)?(www\.)?stackoverflow.com([^'"]*)("|')([^<>]*)>([^<]*))</a>'''

This works great for most links, but these are errors with links to tags inside it. I tried changing the final part of the regex:

([^<]*))</a>'''

in

(.*))</a>'''

But it just got everything on the page after the link, which I don't want. Are there any suggestions as to what I can do to solve this problem?

+3

python regex

Teifion Mar 2 '09 at 17:29

source share

4 answers

>>> import re
>>> pattern = re.compile(r'<a.+href=[\'|\"](.+)[\'|\"].*?>(.+)</a>', re.IGNORECASE)
>>> link = '<a href="http://stackoverflow.com/questions/603199/finding-anchor-text-when-there-are-tags-there">Finding anchor text when there are tags there</a>'
>>> re.match(pattern, link).group(1)
'http://stackoverflow.com/questions/603199/finding-anchor-text-when-there-are-tags-there'
>>> re.match(pattern, link).group(2)
'Finding anchor text when there are tags there'

+3

riza 03 . '09 0:13

- HTML, Beautiful Soup.

+2

Andrew Hare 02 . '09 17:32

Do not greedy search i.e.

(.*?)

+1

Andrew Cox Mar 2 '09 at 17:32

source share

Markusq · Accepted Answer · 2009-03-02T17:37:13+0000

Instead:

[^<>]*

Try:

((?!</a).)*

In other words, match any character that is not the beginning of a sequence </a.

Search for anchor text with tags

More articles: