Php regex to match specific url pattern

I would like to “capture” several hundred URLs from several hundred html pages.

template:

<h2><a href="http://www.the.url.might.be.long/urls.asp?urlid=1" target="_blank">The Website</a></h2>
-1
source share
2 answers

Here's how to do it with native DOM extensions

// GET file
$doc = new DOMDocument;
$doc->loadHtmlFile('http://example.com/');

// Run XPath to fetch all href attributes from a elements
$xpath = new DOMXPath($doc);
$links = $xpath->query('//a/@href');

// collect href attribute values from all DomAttr in array
$urls = array();
foreach($links as $link) {
    $urls[] = $link->value;
}
print_r($urls);

Please note that the above will also find relative links. If you do not want them to change the Xpath to

'//a/@href[starts-with(., "http")]'

Note that using Regex to match HTML is the way to madness. Regex matches string patterns and knows nothing about HTML elements and attributes. DOM, so you should prefer it over Regex for every situation that goes beyond matching the super-trivial string pattern from Markup.

+3
source
'/http:\/\/[^\/]+/[^.]+\.asp\?urlid=\d+/'

HTML Parser, PHP Simple HTML DOM

$html = file_get_html('http://www.google.com/');

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>'; 
+1

Source: https://habr.com/ru/post/1784058/


All Articles