Web crawler links / page logic in PHP

I am writing a basic crawler that simply caches pages with PHP.

All he does is use get_file_contents to get the contents of the web page and regular expression to get all the links <a href="URL">DESCRIPTION</a> - the moment he returns:

 Array { [url] => URL [desc] => DESCRIPTION } 

The problem I am facing is finding out the logic of determining whether the link to the page is local or fussy, whether it can be in a completely different local directory.

It can be any number of combinations: i.e. href="../folder/folder2/blah/page.html" or href="google.com" or href="page.html" - the possibilities are endless.

What will be the correct algorithm for this? I do not want to lose data that may be important.

+1
source share
3 answers

First of all, regex and HTML don't mix. Using:

 foreach(DOMDocument::loadHTML($source)->getElementsByTagName('a') as $a) { $a->getAttribute('href'); } 

Links that may go beyond your site begin with a protocol or // , i.e.

 http://example.com //example.com/ 

href="google.com" is a link to a local file.

But if you want to create a static copy of the site, why not just use wget ?

+3
source

First, consider the properties of local links.

They will either:

  • relative without circuit and without host, or
  • absolute with the "http" or "https" scheme and the host that corresponds to the machine the script is working with

For all the logic you need to determine if the link is local.

Use the parse_url function to separate the various components of the URL to determine the scheme and host.

+1
source

You need to look for http: // in href. In addition, you can determine if it starts with. / Or any combination of "./". If you do not find the "/", then you have to assume that this is its file. Do you want a script for this?

0
source

Source: https://habr.com/ru/post/1385839/


All Articles