I am new to Scrapy. I am studying its use to crawl the entire site for links, in which I would output elements to multiple JSON files. Therefore, I could then upload them to Amazon Cloud Search for indexing. Is it possible to split elements into several files instead of having only one giant file in the end? From what I read, Item Exporters can only output one file per spider. But for this task I use only one CrawlSpider. It would be nice if I could set a limit on the number of elements included in each file, for example 500 or 1000.
Here is the code I have installed so far (based on Dmoz.org used in the tutorial):
dmoz_spider.py
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from tutorial.items import DmozItem
class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/",
]
rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]
def parse_item(self, response):
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
items.py
import scrapy
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
Thanks for the help.