I want to crawl a useful resource (for example, a background image ..) from certain sites. This is not hard work, especially with some great projects like scrapy.
The problem here is that I do not just want to go around this site ONCE. I also want my crawl to last and scan the updated resource. So I want to know if there is a good strategy for a web crawler to get updated pages?
Here is the crude algorithm I was thinking about. I divided the crawl process into rounds. Each round URL repository will provide the crawler with a certain number (for example, 10,000) of URLs to crawl. And then the next round. Detailed steps:
- The crawler adds start URLs to the URL repository.
- crawler requests repository URL no more than N URL to crawl
- scans URLs and updates certain information in the URL repository, such as page content, fetch time, and content change.
- back to step 2
To clarify this, I still need to solve the following question: How to solve the "updated" web page, which indicates the likelihood of updating this web page?
Since this is an open question, we hope that a fruitful discussion will be held here.
source share