Web Crawler Update Strategy

I want to crawl a useful resource (for example, a background image ..) from certain sites. This is not hard work, especially with some great projects like scrapy.

The problem here is that I do not just want to go around this site ONCE. I also want my crawl to last and scan the updated resource. So I want to know if there is a good strategy for a web crawler to get updated pages?

Here is the crude algorithm I was thinking about. I divided the crawl process into rounds. Each round URL repository will provide the crawler with a certain number (for example, 10,000) of URLs to crawl. And then the next round. Detailed steps:

  • The crawler adds start URLs to the URL repository.
  • crawler requests repository URL no more than N URL to crawl
  • scans URLs and updates certain information in the URL repository, such as page content, fetch time, and content change.
  • back to step 2

To clarify this, I still need to solve the following question: How to solve the "updated" web page, which indicates the likelihood of updating this web page?

Since this is an open question, we hope that a fruitful discussion will be held here.

+4
source share
1 answer

The "batch" algorithm that you describe is a common way to implement this; I have worked on several such implementations with scrapy .

The approach I took is to initialize the spider launch URLs to get the next package to crawl and output data (resources + links) in normal mode. Then process them when you want to create the next batch. You can parallelize all this, so you have many spiders that crawl different batches at the same time, if you put the URLs of the same site in the same batch, then scrapy will take care of politeness (with some configuration for your preferences).

An interesting setting is to split the schedule for a short time (within the same batch, inside scrapy) and in the long term (between scan packets), giving some advantages of a more incremental approach, while keeping things a little easier.

There are many approaches to the problem of streamlining workarounds (how to solve the "updated"), and the best approach depends on what your priorities (freshness and completeness, resources come more important than others, etc.).

I would recommend the article "Scanning the Internet" by Christopher Alston and Mark Najork. This is an excellent overview and covers topics of interest to you (packet scan model and scan order).

+6
source

Source: https://habr.com/ru/post/1306000/


All Articles