The most optimized way to store crawler states?

Question

The most optimized way to store crawler states?

I am currently writing a web crawler (using the python scrapy framework ).
I recently had to implement a pause / resume system.
The solution I implemented has the simplest look and basically stores links when they are planned, and marks them as “processed” as soon as they are actually found. Thus, I can get these links (obviously, there is a bit more than just a URL, a depth value, the domain to which the link belongs, etc.) when the spider resumes, and so far everything is working fine.

Right now, I just used a mysql table to handle these repository actions, mainly for rapid prototyping.

Now I would like to know how I could optimize this, since I believe that the database should not be the only option available here. By optimizing, I mean, using a very simple and easy system, while still being able to process a large amount of data written in a short time

Currently, it should be able to handle crawls for several tens of domains, which means saving several thousand links per second ...

Thanks in advance for the suggestions.

+3

optimization web-crawler scrapy persistence storage

Sylvain Nov 13 '09 at 14:10

source share

2 answers

PyCon 2009 , , .

pickle .

+1

John Paulett 13 . '09 14:43

Alex Martelli · Accepted Answer · 2009-11-13T15:39:50+0000

, , - , , , . , , .

, 100% - - - , , . , fsync'ed.

, ( , , URL-, - ), - , t, (, Berkeley DB ); , .

The most optimized way to store crawler states?

More articles: