Scrapy overwrites json files instead of adding file

Question

Scrapy overwrites json files instead of adding file

Is there a way to overwrite the specified file instead of adding it?

Example)

scrapy crawl myspider -o "/path/to/json/my.json" -t json scrapy crawl myspider -o "/path/to/json/my.json" -t json

My.json will be added instead of overwriting it.

+5

python scrapy

hooliooo Oct 15 '15 at 5:22

source share

4 answers

To solve this problem, I created a subclass from scrapy.extensions.feedexport.FileFeedStorage in the myproject directory.

This is my customexport.py :

 """Custom Feed Exports extension.""" import os from scrapy.extensions.feedexport import FileFeedStorage class CustomFileFeedStorage(FileFeedStorage): """ A File Feed Storage extension that overwrites existing files. See: https://github.com/scrapy/scrapy/blob/master/scrapy/extensions/feedexport.py#L79 """ def open(self, spider): """Return the opened file.""" dirname = os.path.dirname(self.path) if dirname and not os.path.exists(dirname): os.makedirs(dirname) # changed from 'ab' to 'wb' to truncate file when it exists return open(self.path, 'wb')

Then I added the following to settings.py (see https://doc.scrapy.org/en/1.2/topics/feed-exports.html#feed-storages-base ):

 FEED_STORAGES_BASE = { '': 'myproject.customexport.CustomFileFeedStorage', 'file': 'myproject.customexport.CustomFileFeedStorage', }

Now every time I write to a file, it is overwritten because of this.

+6

robkorv Nov 26 '16 at 21:06

source share

This is the old known "problem" of Scrapy. Every time you start a crawl and you don’t want to save the results of previous calls, you need to delete the file. The idea is that you want to crawl different sites or the same site in different time frames so that you can accidentally lose the results that you have already collected. What could be bad.

The solution is to write your own object pipeline in which you open the target file for 'w' instead of 'a' .

To see how to write such a pipeline, see the documents: http://doc.scrapy.org/en/latest/topics/item-pipeline.html#writing-your-own-item-pipeline (specifically for exporting JSON: http: //doc.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-a-json-file )

+2

Ghajba Oct 15 '15 at 5:46

source share

Since the accepted answer gave me problems with unvalid json, this might work:

 find "/path/to/json/" -name "my.json" -exec rm {} \; && scrapy crawl myspider -t json -o "/path/to/json/my.json"

0

miguelfg May 27 '16 at 11:19

source share

eLRuLL · Accepted Answer · 2015-11-02T21:47:38+0000

 scrapy crawl myspider -t json --nolog -o - > "/path/to/json/my.json"

Scrapy overwrites json files instead of adding file

More articles: