Scrapy - output multiple JSON files

Question

Scrapy - output multiple JSON files

I am new to Scrapy. I am studying its use to crawl the entire site for links, in which I would output elements to multiple JSON files. Therefore, I could then upload them to Amazon Cloud Search for indexing. Is it possible to split elements into several files instead of having only one giant file in the end? From what I read, Item Exporters can only output one file per spider. But for this task I use only one CrawlSpider. It would be nice if I could set a limit on the number of elements included in each file, for example 500 or 1000.

Here is the code I have installed so far (based on Dmoz.org used in the tutorial):

dmoz_spider.py

import scrapy

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from tutorial.items import DmozItem

class DmozSpider(CrawlSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/",
    ]

    rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]

    def parse_item(self, response):
       for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item

items.py

import scrapy

class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

Thanks for the help.

+4

json python scrapy

liteshade06 30 . '15 15:50

1

alecxe · Accepted Answer · 2015-09-30T16:07:31+0000

, .

, jsonlines format JSON , .

, , JSON.

Amazon Cloud Search .

, Amazon S3 ( , , FYI).

Scrapy - output multiple JSON files

More articles: