Even a minimally functional web crawler requires much more complexity than you might imagine, and the situation you described is not a problem. Scanners work with a certain width search option, so even if they don't do anything to detect black holes, it doesn't really matter. Another typical feature of web crawlers that helps is that they avoid collecting a large number of pages from one domain in a short period of time, because otherwise they will inadvertently carry out a DOS attack against any site with less bandwidth than scanner.
Although the finder does not necessarily detect black holes, a good one can have all kinds of heuristics to avoid wasting time on pages with a low value. For example, he may choose to ignore pages that do not have a minimum amount of text in English (or any other language), pages that contain only links, pages that appear to contain binary data, etc. The heuristic does not have to be ideal, because the main search width for the first time ensures that no site can spend too much time crawling, and the pure size of the web page means that even if it skips some "good" pages, there is always many other good pages to find. (Of course, this is from the point of view of the web crawler, if you own skipped pages, this may be a problem for you, but companies like Google, which run web crawlers, intentionally hide the exact information about such things because they donβt want to, so that people try to get ahead of their heuristics.)
source share