How does the scanner provide maximum coverage?

I read several articles about scanning on the Internet and learned the basics of scanning. According to them, web crawlers use only URLs received by other web pages and passing through a tree (actually a grid)

In this case, as a searcher provides maximum coverage. Obviously, there may be many sites that do not have links to other pages / sites. Does search engines support any other mechanisms besides crawling and manual registration? (i.e. retrieving information from domain registries)

If they are crawl-based, how do you choose a good set of Root sites to start crawling? (We are not able to predict the results. If we select 100 sites without links to links, only 100 sites + their internal pages will appear on the screen)

+4
source share
3 answers

Obviously, there may be many sites that do not have referral links from other pages / sites.

I do not think that this is really the same big problem as you think.

Does search engines support any other mechanisms besides crawling and manual registration? (i.e. retrieving information from domain registries)

No, which I heard about.

If they are based only on crawling, How to choose a good set of "Root" sites will start crawling?

Any kind of universal web directory, such as an open directory project , would be an ideal candidate, as would social bookmarking sites like Digg or del.icio.us

+3
source

One method used to help crawlers is a β€œsitemap”. A sitemap is basically a file that lists the contents of a website, so the crawler knows where to go, especially if your site has dynamic content. A more accurate site map will greatly increase the accuracy of the crawler.

Here is some information about the Google sitemap:

http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40318

+1
source

There is no magic mechanism that would allow the crawler to find a site that no other site has cited, already crawled or not manually added to the crawler.

The crawler only crosses the link schedule, starting with a set of manually registered - and therefore predefined - roots. Everything that is outside the schedule will be inaccessible to the searcher - he will not have the means to search for this content.

+1
source

Source: https://habr.com/ru/post/1285724/


All Articles