I have Scrapy (version 1.0.3) in which I also extract some data from a web page, and I also upload a file like this (simplified):
def extract_data(self, response): title = response.xpath('//html/head/title/text()').extract()[0].strip() my_item = MyItem() my_item['title'] = title file_url = response.xpath('...get url of file...') file_urls = [file_url]
in pipelines.py I just override FilePipeline to change the file name:
from scrapy.pipelines.files import FilesPipeline class CustomFilesPipeline(FilesPipeline): def file_path(self, request, response=None, info=None): filename = format_filename(request.url) return filename
in items.py I have:
class MyItem(scrapy.Item): title = scrapy.Field() class FileItem(scrapy.Item): file_urls = scrapy.Field() files = scrapy.Field()
in settings.py I have:
ITEM_PIPELINES = { 'myscraping.pipelines.CustomFilesPipeline': 100 }
now in the csv output file I get something like this:
title1 title2 , , title3 etc.
It seems that the empty lines (having only a comma) represent the downloaded file, and I would like to know or get recommendations on how to prevent such lines from appearing in the csv output file. (files are saved in a folder).
In the Scrapy settings, I found out about FEED_STORE_EMPTY (by default it is false, i.e. It should not export empty channels), but this does not apply to the files that I assume.
I feel that this should do something with pipelines, but I cannot figure out how to do this.
any help would be appreciated
source share