Running scrapy from inside a Python script - CSV exporter not working

My scraper works fine when I run it from the command line, but when I try to run it from a python script (using the method described here using Twisted), it does not output the two CSV files that it usually executes. I have a pipeline that creates and populates these files, one of them uses CsvItemExporter (), and the other using writeCsvFile (). Here is the code:

class CsvExportPipeline(object): def __init__(self): self.files = {} @classmethod def from_crawler(cls, crawler): pipeline = cls() crawler.signals.connect(pipeline.spider_opened, signals.spider_opened) crawler.signals.connect(pipeline.spider_closed, signals.spider_closed) return pipeline def spider_opened(self, spider): nodes = open('%s_nodes.csv' % spider.name, 'w+b') self.files[spider] = nodes self.exporter1 = CsvItemExporter(nodes, fields_to_export=['url','name','screenshot']) self.exporter1.start_exporting() self.edges = [] self.edges.append(['Source','Target','Type','ID','Label','Weight']) self.num = 1 def spider_closed(self, spider): self.exporter1.finish_exporting() file = self.files.pop(spider) file.close() writeCsvFile(getcwd()+r'\edges.csv', self.edges) def process_item(self, item, spider): self.exporter1.export_item(item) for url in item['links']: self.edges.append([item['url'],url,'Directed',self.num,'',1]) self.num += 1 return item 

Here is my file structure:

 SiteCrawler/ # the CSVs are normally created in this folder runspider.py # this is the script that runs the scraper scrapy.cfg SiteCrawler/ __init__.py items.py pipelines.py screenshooter.py settings.py spiders/ __init__.py myfuncs.py sitecrawler_spider.py 

The scraper seems to function normally in all other ways. The result on the command line suggests that the expected number of pages has been crawled and the spider seems to have finished normally. I do not receive error messages.

---- EDIT: ----

Inserting print instructions and syntax errors into the pipeline has no effect, so the pipeline seems to be ignored. Why could this be?

Here is the code for the script that starts the scraper (runpider.py):

 from twisted.internet import reactor from scrapy import log, signals from scrapy.crawler import Crawler from scrapy.settings import Settings from scrapy.xlib.pydispatch import dispatcher import logging from SiteCrawler.spiders.sitecrawler_spider import MySpider def stop_reactor(): reactor.stop() dispatcher.connect(stop_reactor, signal=signals.spider_closed) spider = MySpider() crawler = Crawler(Settings()) crawler.configure() crawler.crawl(spider) crawler.start() log.start(loglevel=logging.DEBUG) log.msg('Running reactor...') reactor.run() # the script will block here until the spider is closed log.msg('Reactor stopped.') 
+6
source share
2 answers

Replace "from scrapy.settings import settings" with "from scrapy.utils.project import get_project_settings as Settings" fixed problem.

The solution is found here . No explanation of the decision was provided.

alecxe provided an example on how to run Scrapy from within a Python script.

EDIT:

After reading the alecxe message in more detail, I now see the difference between "from scrapy.settings import Settings" and "from scrapy.utils.project import get_project_settings as Settings". The latter allows you to use the project settings file, not the defualt settings file. Read the alecxe post (linked above) for more details.

+1
source

In my project, I call scrapy code inside another python script using os.system

 import os os.chdir('/home/admin/source/scrapy_test') command = "scrapy crawl test_spider -s FEED_URI='file:///home/admin/scrapy/data.csv' -s LOG_FILE='/home/admin/scrapy/scrapy_test.log'" return_code = os.system(command) print 'done' 
0
source

Source: https://habr.com/ru/post/949890/


All Articles