Scan for Eternity

I recently created a new recurring web application. These events can be repeated daily, weekly, or monthly.

This all works great. But when I started to create an event viewing page (which will be visible on the public Internet), an idea occurred to me.

If the crawler clicks this page using the next and previous buttons to view dates, will it go on forever? So I gave up using shared HTML links and used AJAX. This means that bots will not be able to follow links.

But this method means that I am losing this functionality for users without Javascript. Or is the number of users without Javascript too small to worry?

Is there a better way to handle this?

I'm also very interested in how robots like Google Crawler detect black holes like these, and what do they do to deal with them?

+4
source share
2 answers

Add the nofollow tag to the page or individual links that you do not want to crawl. It can be in robots.txt or in the page source. See Robot Exclusion Standard

You still need to think about how to fight off unscrupulous bots that do not meet the standard.

+4
source

Even a minimally functional web crawler requires much more complexity than you might imagine, and the situation you described is not a problem. Scanners work with a certain width search option, so even if they don't do anything to detect black holes, it doesn't really matter. Another typical feature of web crawlers that helps is that they avoid collecting a large number of pages from one domain in a short period of time, because otherwise they will inadvertently carry out a DOS attack against any site with less bandwidth than scanner.

Although the finder does not necessarily detect black holes, a good one can have all kinds of heuristics to avoid wasting time on pages with a low value. For example, he may choose to ignore pages that do not have a minimum amount of text in English (or any other language), pages that contain only links, pages that appear to contain binary data, etc. The heuristic does not have to be ideal, because the main search width for the first time ensures that no site can spend too much time crawling, and the pure size of the web page means that even if it skips some "good" pages, there is always many other good pages to find. (Of course, this is from the point of view of the web crawler, if you own skipped pages, this may be a problem for you, but companies like Google, which run web crawlers, intentionally hide the exact information about such things because they don’t want to, so that people try to get ahead of their heuristics.)

+2
source

Source: https://habr.com/ru/post/1445273/


All Articles