I want to scan the whole site, I read several streams, but I can’t get the data at the 2nd level.
That is, I can return links from the start page, but then I can’t find a way to parse the links and get the contents of each link ...
The code I use is:
<?php // SELECT STARTING PAGE $url = 'http://mydomain.com/'; $html= file_get_contents($url); // GET ALL THE LINKS OF EACH PAGE // create a dom object $dom = new DOMDocument(); @$dom->loadHTML($html); // run xpath for the dom $xPath = new DOMXPath($dom); // get links from starting page $elements = $xPath->query("//a/@href"); foreach ($elements as $e) { echo $e->nodeValue. "<br />"; } // Parse each page using the extracted links? ?>
Can someone help me in the last part with an example?
I'll be very grateful!
Ok, thanks for your answers! I tried something, but I have not had time to get the results yet - I am new to programming.
Below you can find 2 of my attempts - the first one tries to parse the links, and the second one tries to replace the contents of file_get with Curl:
1) <?php // GET STARTING PAGE $url = 'http://www.capoeira.com.gr/'; $html= file_get_contents($url); //GET ALL THE LINKS FROM STARTING PAGE // create a dom object $dom = new DOMDocument(); @$dom->loadHTML($html); // run xpath for the dom $xPath = new DOMXPath($dom); // get specific elements from the sites $elements = $xPath->query("//a/@href"); //PARSE EACH LINK foreach($elements as $e) { $URLS= file_get_contents($e); $dom = new DOMDocument(); @$dom->loadHTML($html); $xPath = new DOMXPath($dom); $output = $xPath->query("//div[@class='content-entry clearfix']"); echo $output ->nodeValue; } ?>
For the above code, I get a Warning: file_get_contents () expects parameter 1 to be a string, the object is listed in .. /example.php on line 26
2)
<?php $curl = curl_init(); curl_setopt($curl, CURLOPT_POST, 1); curl_setopt($curl, CURLOPT_URL, "http://capoeira.com.gr"); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); $content= curl_exec($curl); curl_close($curl); $dom = new DOMDocument(); @$dom->loadHTML($content); $xPath = new DOMXPath($dom); $elements = $xPath->query("//a/@href"); foreach ($elements as $e) { echo $e->nodeValue. "<br />"; } ?>
I am not getting any results. I tried to repeat $ content and then I get:
You do not have permission to access this server.
In addition, when trying to use ErrorDocument to process a request, a 413 Request Entity Too Large ... error was found.
Any ideas please? :)