The most optimized way to store crawler states?

I am currently writing a web crawler (using the python scrapy framework ).
I recently had to implement a pause / resume system.
The solution I implemented has the simplest look and basically stores links when they are planned, and marks them as โ€œprocessedโ€ as soon as they are actually found. Thus, I can get these links (obviously, there is a bit more than just a URL, a depth value, the domain to which the link belongs, etc.) when the spider resumes, and so far everything is working fine.

Right now, I just used a mysql table to handle these repository actions, mainly for rapid prototyping.

Now I would like to know how I could optimize this, since I believe that the database should not be the only option available here. By optimizing, I mean, using a very simple and easy system, while still being able to process a large amount of data written in a short time

Currently, it should be able to handle crawls for several tens of domains, which means saving several thousand links per second ...

Thanks in advance for the suggestions.

+3
source share
2 answers

, , - , , , . , , .

, 100% - - - , , . , fsync'ed.

, ( , , URL-, - ), - , t, (, Berkeley DB ); , .

+3

PyCon 2009 , , .

pickle .

+1

Source: https://habr.com/ru/post/1722786/


All Articles