Copy duplicate queries

What is the difference between the Duplicate Filter that exists in the scheduler and the IgnoreVisitedItems middleware ?

Google group thread, which suggests the presence of a duplicate filter in the Scheduler: http://groups.google.com/group/scrapy-users/browse_thread/thread/8e218bcc5b293532

+4
source share
1 answer

The duplicate filter in the scheduler filters out only those URLs that have already been seen in one web run (this means that it will receive a reset on subsequent runs). The IgnoreVistedItems middleware will maintain state between runs and avoid visited URLs that have been seen in the past, but only for the final URLs of the elements so that the rest of the site can be crawled (to find new elements).

+12
source

Source: https://habr.com/ru/post/1396537/


All Articles