How to save data from a scanning crawler into a variable?

I am currently creating a web application designed to display data collected by a spider-savior. The user makes a request, the spider scans the website, and then returns data to the application to get a hint. I would like to get the data directly from the scraper without relying on the intermediate .csv or .json file. Sort of:

from scrapy.crawler import CrawlerProcess from scraper.spiders import MySpider url = 'www.example.com' spider = MySpider() crawler = CrawlerProcess() crawler.crawl(spider, start_urls=[url]) crawler.start() data = crawler.data # this bit 
+6
source share
3 answers

This is not so simple, because Scrapy does not block and works in an event loop; it uses a Twisted event loop, and the Twisted event loop does not restart, so you cannot write crawler.start(); data = crawler.data crawler.start(); data = crawler.data - after crawler.start() process runs forever, calling registered callbacks until it is killed or terminated.

These answers may matter:

If you use the event loop in your application (for example, you have a Twisted or Tornado web server), you can get data from the bypass without saving it to disk. The idea is to listen for the item_scraped signal. I use the following helper to make it nicer:

 import collections from twisted.internet.defer import Deferred from scrapy.crawler import Crawler from scrapy import signals def scrape_items(crawler_runner, crawler_or_spidercls, *args, **kwargs): """ Start a crawl and return an object (ItemCursor instance) which allows to retrieve scraped items and wait for items to become available. Example: .. code-block:: python @inlineCallbacks def f(): runner = CrawlerRunner() async_items = scrape_items(runner, my_spider) while (yield async_items.fetch_next): item = async_items.next_item() # ... # ... This convoluted way to write a loop should become unnecessary in Python 3.5 because of ``async for``. """ crawler = crawler_runner.create_crawler(crawler_or_spidercls) d = crawler_runner.crawl(crawler, *args, **kwargs) return ItemCursor(d, crawler) class ItemCursor(object): def __init__(self, crawl_d, crawler): self.crawl_d = crawl_d self.crawler = crawler crawler.signals.connect(self._on_item_scraped, signals.item_scraped) crawl_d.addCallback(self._on_finished) crawl_d.addErrback(self._on_error) self.closed = False self._items_available = Deferred() self._items = collections.deque() def _on_item_scraped(self, item): self._items.append(item) self._items_available.callback(True) self._items_available = Deferred() def _on_finished(self, result): self.closed = True self._items_available.callback(False) def _on_error(self, failure): self.closed = True self._items_available.errback(failure) @property def fetch_next(self): """ A Deferred used with ``inlineCallbacks`` or ``gen.coroutine`` to asynchronously retrieve the next item, waiting for an item to be crawled if necessary. Resolves to ``False`` if the crawl is finished, otherwise :meth:`next_item` is guaranteed to return an item (a dict or a scrapy.Item instance). """ if self.closed: # crawl is finished d = Deferred() d.callback(False) return d if self._items: # result is ready d = Deferred() d.callback(True) return d # We're active, but item is not ready yet. Return a Deferred which # resolves to True if item is scraped or to False if crawl is stopped. return self._items_available def next_item(self): """Get a document from the most recently fetched batch, or ``None``. See :attr:`fetch_next`. """ if not self._items: return None return self._items.popleft() 

The API is inspired by motor , the MongoDB driver for asynchronous frameworks. Using scrape_items, you can get elements from callbacks with twisted or tornadoes as soon as they are cleared, just like you extract elements from a MongoDB query.

+7
source

You can pass the variable as an attribute of the class and store the data in it.

Of course, you need to add the attribute to the __init__ method of your spider class.

 from scrapy.crawler import CrawlerProcess from scraper.spiders import MySpider url = 'www.example.com' spider = MySpider() crawler = CrawlerProcess() data = [] crawler.crawl(spider, start_urls=[url], data) crawler.start() print(data) 
0
source

It may be too late, but it may help others, you can pass the Spider callback function and call this function to return your data as follows:

Dummy that we are going to use:

 class Trial(Spider): name = 'trial' start_urls = [''] def __init__(self, **kwargs): super().__init__(**kwargs) self.output_callback = kwargs.get('args').get('callback') def parse(self, response): pass def close(self, spider, reason): self.output_callback(['Hi, This is the output.']) 

Custom class with callback:

 from scrapy.crawler import CrawlerProcess from scrapyapp.spiders.trial_spider import Trial class CustomCrawler: def __init__(self): self.output = None self.process = CrawlerProcess(settings={'LOG_ENABLED': False}) def yield_output(self, data): self.output = data def crawl(self, cls): self.process.crawl(cls, args={'callback': self.yield_output}) self.process.start() def crawl_static(cls): crawler = CustomCrawler() crawler.crawl(cls) return crawler.output 

Then you can do:

 out = crawl_static(Trial) print(out) 
0
source

Source: https://habr.com/ru/post/1260066/


All Articles