Website Crawl, Link Acquisition, Link Crawl with PHP and XPATH

I want to scan the whole site, I read several streams, but I can’t get the data at the 2nd level.

That is, I can return links from the start page, but then I can’t find a way to parse the links and get the contents of each link ...

The code I use is:

<?php // SELECT STARTING PAGE $url = 'http://mydomain.com/'; $html= file_get_contents($url); // GET ALL THE LINKS OF EACH PAGE // create a dom object $dom = new DOMDocument(); @$dom->loadHTML($html); // run xpath for the dom $xPath = new DOMXPath($dom); // get links from starting page $elements = $xPath->query("//a/@href"); foreach ($elements as $e) { echo $e->nodeValue. "<br />"; } // Parse each page using the extracted links? ?> 

Can someone help me in the last part with an example?

I'll be very grateful!


Ok, thanks for your answers! I tried something, but I have not had time to get the results yet - I am new to programming.

Below you can find 2 of my attempts - the first one tries to parse the links, and the second one tries to replace the contents of file_get with Curl:

  1) <?php // GET STARTING PAGE $url = 'http://www.capoeira.com.gr/'; $html= file_get_contents($url); //GET ALL THE LINKS FROM STARTING PAGE // create a dom object $dom = new DOMDocument(); @$dom->loadHTML($html); // run xpath for the dom $xPath = new DOMXPath($dom); // get specific elements from the sites $elements = $xPath->query("//a/@href"); //PARSE EACH LINK foreach($elements as $e) { $URLS= file_get_contents($e); $dom = new DOMDocument(); @$dom->loadHTML($html); $xPath = new DOMXPath($dom); $output = $xPath->query("//div[@class='content-entry clearfix']"); echo $output ->nodeValue; } ?> 

For the above code, I get a Warning: file_get_contents () expects parameter 1 to be a string, the object is listed in .. /example.php on line 26

2)

  <?php $curl = curl_init(); curl_setopt($curl, CURLOPT_POST, 1); curl_setopt($curl, CURLOPT_URL, "http://capoeira.com.gr"); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); $content= curl_exec($curl); curl_close($curl); $dom = new DOMDocument(); @$dom->loadHTML($content); $xPath = new DOMXPath($dom); $elements = $xPath->query("//a/@href"); foreach ($elements as $e) { echo $e->nodeValue. "<br />"; } ?> 

I am not getting any results. I tried to repeat $ content and then I get:

You do not have permission to access this server.

In addition, when trying to use ErrorDocument to process a request, a 413 Request Entity Too Large ... error was found.

Any ideas please? :)

+6
source share
5 answers

You can try the following. See this thread for more details.

 <?php //set_time_limit (0); function crawl_page($url, $depth = 5){ $seen = array(); if(($depth == 0) or (in_array($url, $seen))){ return; } $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_TIMEOUT, 30); curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); $result = curl_exec ($ch); curl_close ($ch); if( $result ){ $stripped_file = strip_tags($result, "<a>"); preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']+"."(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $stripped_file, $matches, PREG_SET_ORDER ); foreach($matches as $match){ $href = $match[1]; if (0 !== strpos($href, 'http')) { $path = '/' . ltrim($href, '/'); if (extension_loaded('http')) { $href = http_build_url($href , array('path' => $path)); } else { $parts = parse_url($href); $href = $parts['scheme'] . '://'; if (isset($parts['user']) && isset($parts['pass'])) { $href .= $parts['user'] . ':' . $parts['pass'] . '@'; } $href .= $parts['host']; if (isset($parts['port'])) { $href .= ':' . $parts['port']; } $href .= $path; } } crawl_page($href, $depth - 1); } } echo "Crawled {$href}"; } crawl_page("http://www.sitename.com/",3); ?> 
+3
source
 $doc = new DOMDocument; $doc->load('file.htm'); $items = $doc->getElementsByTagName('a'); foreach($items as $value) { echo $value->nodeValue . "\n"; $attrs = $value->attributes; echo $attrs->getNamedItem('href')->nodeValue . "\n"; }; 
+2
source

Please check the code below, hope this helps you.

 <?php $html = new DOMDocument(); @$html->loadHtmlFile('http://www.yourdomain.com'); $xpath = new DOMXPath( $html ); $nodelist = $xpath->query( "//div[@class='A-CLASS-Name']/h3/a/@href" ); foreach ($nodelist as $n){ echo $n->nodeValue."\n<br>"; } ?> 

Thanks Roger

+1
source

find a link from a site recursively with depth

 <?php $depth = 1; print_r(getList($depth)); function getList($depth) { $lists = getDepth($depth); return $lists; } function getUrl($request_url) { $countValid = 0; $brokenCount =0; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $request_url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // We want to get the respone $result = curl_exec($ch); $regex = '|<a.*?href="(.*?)"|'; preg_match_all($regex, $result, $parts); $links = $parts[1]; $lists = array(); foreach ($links as $link) { $url = htmlentities($link); $result =getFlag($url); if($result == true) { $UrlLists["clean"][$countValid] =$url; $countValid++; } else { $UrlLists["broken"][$brokenCount]= "broken->".$url; $brokenCount++; } } curl_close($ch); return $UrlLists; } function ZeroDepth($list) { $request_url = $list; $listss["0"]["0"] = getUrl($request_url); $lists["0"]["0"]["clean"] = array_unique($listss["0"]["0"]["clean"]); $lists["0"]["0"]["broken"] = array_unique($listss["0"]["0"]["broken"]); return $lists; } function getDepth($depth) { // $list =OW_URL_HOME; $list = "https://example.com";//enter the url of website $lists =ZeroDepth($list); for($i=1;$i<=$depth;$i++) { $l= $i; $l= $l-1; $depthArray=1; foreach($lists[$l][$l]["clean"] as $depthUrl) { $request_url = $depthUrl; $lists[$i][$depthArray]["requst_url"]=$request_url; $lists[$i][$depthArray] = getUrl($request_url); } } return $lists; } function getFlag($url) { $url_response = array(); $curl = curl_init(); $curl_options = array(); $curl_options[CURLOPT_RETURNTRANSFER] = true; $curl_options[CURLOPT_URL] = $url; $curl_options[CURLOPT_NOBODY] = true; $curl_options[CURLOPT_TIMEOUT] = 60; curl_setopt_array($curl, $curl_options); curl_exec($curl); $status = curl_getinfo($curl, CURLINFO_HTTP_CODE); if ($status == 200) { return true; } else { return false; } curl_close($curl); } ?>` 
+1
source
 <?php $path='http://www.hscripts.com/'; $html = file_get_contents($path); $dom = new DOMDocument(); @$dom->loadHTML($html); // grab all the on the page $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate("/html/body//a"); for ($i = 0; $i < $hrefs->length; $i++ ) { $href = $hrefs->item($i); $url = $href->getAttribute('href'); echo $url.'<br />'; } ?> 

you can use the above code to get all possible links

0
source

Source: https://habr.com/ru/post/912947/


All Articles