Scrapy: Middleware / Pipeline single instance

I create a local response cache, for which I create Pipeline , because I need to store information about the item depending on its identifier collected from the site.

Now I also need to create Downloader Middleware , because depending on the previously saved identifier, I do not want to delete the site with a new Request , so I intercept Request before sending it to the server, check if the identifier already exists in my cache, and if so, return one and the same item from my cache.

Now that you see that both Pipeline and Middleware should work together, so the separation does not seem very clean (I also have variables on both that I want to be unique), but when I configure how their respective settings are:

 DOWNLOADER_MIDDLEWARES = { 'myproject.urlcache.CachePipelineMiddleware': 1, } ITEM_PIPELINES = { 'myproject.urlcache.CachePipelineMiddleware': 800, } 

I get two different instances (checking the log message on the constructor, so it is created twice).

How can I make sure that only one instance is created and that I will not conflict with the Pipeline and Downloader Middleware functionality of my project?

+5
source share
1 answer

I just realized that this is a simple Singleton problem, and scrapy will work with the same instance for Pipeline and Middleware.

I create this Singleton class first:

 class Singleton(type): _instances = {} def __call__(cls, *args, **kwargs): if cls not in cls._instances: cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs) return cls._instances[cls] 

And then, in the class for Pipeline / Middleware, I add the following:

 class CachePipelineMiddleware(object): __metaclass__ = Singleton def process_item(self, item, spider): # it works as a Pipeline def process_request(self, request, spider): # it works as a Middleware 
+9
source

Source: https://habr.com/ru/post/1274448/


All Articles