Designing a Multiprocessor Spider in Python

Question

Designing a Multiprocessor Spider in Python

I am working on a multiprocessor spider in Python. It should begin to clear one page for links and work from there. In particular, the top-level page contains a list of categories, events of second-level pages in these categories, and final third-level participants in events. I cannot predict how many categories, events or participants there will be.

I’m losing a little how best to create such a spider, and, in particular, to find out when he completed the scan (he must keep going until he finds and picks up each corresponding page).

Ideally, the first scraper will be synchronous, and everything else will be asynchronous to maximize parallel parsing and adding to the database, but I was fixated on how to figure out when the scan was completed.

How would you suggest structuring a spider in terms of parallel processes and, in particular, the indicated problem?

+3

python multithreading design web-crawler

wbg Nov 23 '09 at 17:20

source share

2 answers

, Scrapy, ( Twisted) -. , XPath !

!

( , , , sqlite , ... , ... Twisted in .)

+2

Gregg Lind 23 . '09 18:01

Joe Koberg · Accepted Answer · 2009-11-23T17:22:53+0000

I assume that you put items in the queue , exhausting the queue with the workers, and the workers find new items to visit and add them to the queue.

It ended when all the workers were idle and the line of items to visit was empty.

task_done(), join() , .

Designing a Multiprocessor Spider in Python

More articles: