This is not so simple, because Scrapy does not block and works in an event loop; it uses a Twisted event loop, and the Twisted event loop does not restart, so you cannot write crawler.start(); data = crawler.data crawler.start(); data = crawler.data - after crawler.start() process runs forever, calling registered callbacks until it is killed or terminated.
These answers may matter:
If you use the event loop in your application (for example, you have a Twisted or Tornado web server), you can get data from the bypass without saving it to disk. The idea is to listen for the item_scraped signal. I use the following helper to make it nicer:
import collections from twisted.internet.defer import Deferred from scrapy.crawler import Crawler from scrapy import signals def scrape_items(crawler_runner, crawler_or_spidercls, *args, **kwargs): """ Start a crawl and return an object (ItemCursor instance) which allows to retrieve scraped items and wait for items to become available. Example: .. code-block:: python @inlineCallbacks def f(): runner = CrawlerRunner() async_items = scrape_items(runner, my_spider) while (yield async_items.fetch_next): item = async_items.next_item() # ... # ... This convoluted way to write a loop should become unnecessary in Python 3.5 because of ``async for``. """ crawler = crawler_runner.create_crawler(crawler_or_spidercls) d = crawler_runner.crawl(crawler, *args, **kwargs) return ItemCursor(d, crawler) class ItemCursor(object): def __init__(self, crawl_d, crawler): self.crawl_d = crawl_d self.crawler = crawler crawler.signals.connect(self._on_item_scraped, signals.item_scraped) crawl_d.addCallback(self._on_finished) crawl_d.addErrback(self._on_error) self.closed = False self._items_available = Deferred() self._items = collections.deque() def _on_item_scraped(self, item): self._items.append(item) self._items_available.callback(True) self._items_available = Deferred() def _on_finished(self, result): self.closed = True self._items_available.callback(False) def _on_error(self, failure): self.closed = True self._items_available.errback(failure) @property def fetch_next(self): """ A Deferred used with ``inlineCallbacks`` or ``gen.coroutine`` to asynchronously retrieve the next item, waiting for an item to be crawled if necessary. Resolves to ``False`` if the crawl is finished, otherwise :meth:`next_item` is guaranteed to return an item (a dict or a scrapy.Item instance). """ if self.closed:
The API is inspired by motor , the MongoDB driver for asynchronous frameworks. Using scrape_items, you can get elements from callbacks with twisted or tornadoes as soon as they are cleared, just like you extract elements from a MongoDB query.
source share