Is there any method of using a separate pipeline for each spider?

Question

Is there any method of using a separate pipeline for each spider?

I want to receive web pages under a different domain, this means that I have to use another spider under the scrap crawl myspider command. However, I have to use different pipeline logic to put the data in the database, as the content of the web pages is different. But for each spider, they must go through all the pipelines that are defined in settings.py. Is there another elegant way to use separate pipelines for each spider?

+4

python web-scraping scrapy scrapy-spider

uuball Jun 29 '13 at 14:29

source share

3 answers

The slightly better version above is as follows. This is better because this method allows you to selectively enable pipelines for different spiders easier than coding "not in ['spider1", "spider2"] in the pipeline above.

In your spider class add:

 #start_urls=... pipelines = ['pipeline1', 'pipeline2'] #allows you to selectively turn on pipelines within spiders #...

Then in each pipeline, you can use the getattr method as magic. Add:

 class pipeline1(): def process_item(self, item, spider): if 'pipeline1' not in getattr(spider, 'pipelines'): return item #...keep going as normal

+5

Tommy Jul 16 '14 at 3:17

source share

More reliable solution; I can’t remember where I found it, but scrapy dev suggested it somewhere. Using this method, you can use some kind of conveyor for all spiders without using a shell. It also makes it so that you don’t need to duplicate the logic of checking whether to use a pipeline.

Packing:

 def check_spider_pipeline(process_item_method): """ This wrapper makes it so pipelines can be turned on and off at a spider level. """ @functools.wraps(process_item_method) def wrapper(self, item, spider): if self.__class__ in spider.pipeline: return process_item_method(self, item, spider) else: return item return wrapper

Using:

 @check_spider_pipeline def process_item(self, item, spider): ........ ........ return item

Spider use:

 pipeline = {some.pipeline, some.other.pipeline .....}

+1

rocktheartsm4l Jun 17 '15 at 19:08

source share

alecxe · Accepted Answer · 2013-06-29T19:05:39+0000

ITEM_PIPELINES setting is defined globally for all spiders in the project during engine startup. It cannot be changed to a spider on the fly.

Here are a few options:

Change the pipeline code. Skip / continue processing elements returned by spiders in the process_item method of your pipeline, for example:
```
 def process_item(self, item, spider): if spider.name not in ['spider1', 'spider2']: return item # process item 
```
Change the way you start scanning. Make a script based on the name of the spider passed as a parameter, override the ITEM_PIPELINES parameter before calling crawler.configure() .

Is there any method of using a separate pipeline for each spider?

More articles: