Regular expression for finding urls inside a hyperlink

There are many regular expressions to match the url. However, I am trying to match URLs that do not appear anywhere in the hypertext tag <a>( HREF, internal value, etc.). Therefore, NONE of the URLs in them must match:

<a href="http://www.example.com/"> something </a>
<a href="http://www.example.com/"> http://www.example2.com </a>
<a href="http://www.example.com/"> <b> something </b> http://www.example.com/ <span> test </span> </a>

Any outside url <a></a>must be matched.

One approach that I tried to use is to use a negative lookahead to see if the first tag <a>after the URL was an open <a>or close </a>. If this is a close </a>, then the URL should be inside the hyperlink. I think this idea was fine, but a negative regex did not work (or rather, the regex was spelled incorrectly). Any advice is greatly appreciated.

+3
source share
4 answers

You can do this in two steps, instead of trying to create one regex:

  • Mix (replace with nothing) the HTML anchor (the entire anchor tag: the opening tag, the content, and the closing tag).

  • Match URL

Perl :

my $curLine = $_; #Do not change $_ if it is needed for something else.
$curLine =~ /<a[^<]+<\/a>//g; #Remove all of HTML anchor tag, "<a", "</a>" and everything in between.
if ( $curLine =~ /http:\/\//)
{
  print "Matched an URL outside a HTML anchor !: $_\n";
}
+2

, , , :

# Note that this is a dummy, you'll need a more sophisticated URL regex
regex = '(<a[^>]+>)|(http://.*)'

, .

0

Peter has a great answer: first, remove the bindings so that

Some text <a href="http://page.net">TeXt</a> and some more text with link http://a.net

replaced by

Some text  and some more text with link http://a.net

Then run the regex that finds the urls:

http://a.net
0
source

Use the DOM to filter the anchor elements, and then the rest is a simple URL.

0
source

Source: https://habr.com/ru/post/1715761/