Regexp to extract all links and anchor texts from HTML

I need one or more regular expressions that can:

1) Take the html of a large page.

2) Find the URLs contained in all links, for example:

<a href="http://example1.com">Test 1</a>
<a class="foo" id="bar" href="http://example2.com">Test 2</a>
<a onclick="foo();" id="bar" href="http://example3.com">Test 3</a>

And so on, it must extract the URL contained in the attribute 'href', regardless of what happens before or afterhref

3) Extract the anchor text of all links, for example, in the examples above, it should return “http://example1.com” and the anchor text “Test 1”, then “http://example2.com" 'and' Test 2 ', etc.

+3
source share
6 answers
<?

$dom = new DomDocument();
$dom->loadHTML($html);
$urls = $dom->getElementsByTagName('a');
+7
source

.

<?php

$string = '<a href="http://example1.com">Test 1</a>
<a class="foo" id="bar" href="http://example2.com">Test 2</a>
<a onclick="foo();" id="bar" href="http://example3.com">Test 3</a>';

if(preg_match_all("|<a.*(?=href=\"([^\"]*)\")[^>]*>([^<]*)</a>|i", $string, $matches))
        {
        /*** if we find the word white, not followed by house ***/
        echo 'Found a match';
        print_r($matches);
    }
else
        {
        /*** if no match is found ***/
        echo 'No match found';
        }
?>
+5
<?php
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
if(preg_match_all("/$regexp/siU", $html, $matches, PREG_SET_ORDER))
{ foreach($matches as $match)
{// $match[2] = link address
// $match[3] = link text}
}
?>

, .

+5

- :

//not tested
$regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";
+2
/<a[^>]+href\s*=\s*["']([^"']+)["'][^>]*>(.*?)<\/a>/mis
+1

RegEx HTML, :

\b(((src|href|action|url) *(=|:) *(?<mh>"|'|))(?<url>[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k<mh>|url *\( *(?<mc>"|'|)(?<url>[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k<mc>\))

, "plain" (.. ) HTML:

(<(?<tag>script|style)[\s\S]*?</\k<tag>>)|<!--[\s\S]*?-->|<[\s\S]*?>|(?<text>[^<>]*)

: http://www.martinwardener.com/regex

0

Source: https://habr.com/ru/post/1784057/


All Articles