I am currently writing a web crawler (using the python scrapy framework ).
I recently had to implement a pause / resume system.
The solution I implemented has the simplest look and basically stores links when they are planned, and marks them as โprocessedโ as soon as they are actually found. Thus, I can get these links (obviously, there is a bit more than just a URL, a depth value, the domain to which the link belongs, etc.) when the spider resumes, and so far everything is working fine.
Right now, I just used a mysql table to handle these repository actions, mainly for rapid prototyping.
Now I would like to know how I could optimize this, since I believe that the database should not be the only option available here. By optimizing, I mean, using a very simple and easy system, while still being able to process a large amount of data written in a short time
Currently, it should be able to handle crawls for several tens of domains, which means saving several thousand links per second ...
Thanks in advance for the suggestions.
source
share