Scrapy overwrites json files instead of adding file

Is there a way to overwrite the specified file instead of adding it?

Example)

scrapy crawl myspider -o "/path/to/json/my.json" -t json scrapy crawl myspider -o "/path/to/json/my.json" -t json 

My.json will be added instead of overwriting it.

+5
source share
4 answers
 scrapy crawl myspider -t json --nolog -o - > "/path/to/json/my.json" 
+9
source

To solve this problem, I created a subclass from scrapy.extensions.feedexport.FileFeedStorage in the myproject directory.

This is my customexport.py :

 """Custom Feed Exports extension.""" import os from scrapy.extensions.feedexport import FileFeedStorage class CustomFileFeedStorage(FileFeedStorage): """ A File Feed Storage extension that overwrites existing files. See: https://github.com/scrapy/scrapy/blob/master/scrapy/extensions/feedexport.py#L79 """ def open(self, spider): """Return the opened file.""" dirname = os.path.dirname(self.path) if dirname and not os.path.exists(dirname): os.makedirs(dirname) # changed from 'ab' to 'wb' to truncate file when it exists return open(self.path, 'wb') 

Then I added the following to settings.py (see https://doc.scrapy.org/en/1.2/topics/feed-exports.html#feed-storages-base ):

 FEED_STORAGES_BASE = { '': 'myproject.customexport.CustomFileFeedStorage', 'file': 'myproject.customexport.CustomFileFeedStorage', } 

Now every time I write to a file, it is overwritten because of this.

+6
source

This is the old known "problem" of Scrapy. Every time you start a crawl and you don’t want to save the results of previous calls, you need to delete the file. The idea is that you want to crawl different sites or the same site in different time frames so that you can accidentally lose the results that you have already collected. What could be bad.

The solution is to write your own object pipeline in which you open the target file for 'w' instead of 'a' .

To see how to write such a pipeline, see the documents: http://doc.scrapy.org/en/latest/topics/item-pipeline.html#writing-your-own-item-pipeline (specifically for exporting JSON: http: //doc.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-a-json-file )

+2
source

Since the accepted answer gave me problems with unvalid json, this might work:

 find "/path/to/json/" -name "my.json" -exec rm {} \; && scrapy crawl myspider -t json -o "/path/to/json/my.json" 
0
source

Source: https://habr.com/ru/post/1233752/


All Articles