I am working on a multiprocessor spider in Python. It should begin to clear one page for links and work from there. In particular, the top-level page contains a list of categories, events of second-level pages in these categories, and final third-level participants in events. I cannot predict how many categories, events or participants there will be.
Iβm losing a little how best to create such a spider, and, in particular, to find out when he completed the scan (he must keep going until he finds and picks up each corresponding page).
Ideally, the first scraper will be synchronous, and everything else will be asynchronous to maximize parallel parsing and adding to the database, but I was fixated on how to figure out when the scan was completed.
How would you suggest structuring a spider in terms of parallel processes and, in particular, the indicated problem?
source
share