Designing a Multiprocessor Spider in Python

I am working on a multiprocessor spider in Python. It should begin to clear one page for links and work from there. In particular, the top-level page contains a list of categories, events of second-level pages in these categories, and final third-level participants in events. I cannot predict how many categories, events or participants there will be.

I’m losing a little how best to create such a spider, and, in particular, to find out when he completed the scan (he must keep going until he finds and picks up each corresponding page).

Ideally, the first scraper will be synchronous, and everything else will be asynchronous to maximize parallel parsing and adding to the database, but I was fixated on how to figure out when the scan was completed.

How would you suggest structuring a spider in terms of parallel processes and, in particular, the indicated problem?

+3
source share
2 answers

I assume that you put items in the queue , exhausting the queue with the workers, and the workers find new items to visit and add them to the queue.

It ended when all the workers were idle and the line of items to visit was empty.

task_done(), join() , .

+1

, Scrapy, ( Twisted) -. , XPath !

!

( , , , sqlite , ... , ... Twisted in .)

+2

Source: https://habr.com/ru/post/1723685/


All Articles