When saving the cleaned item and file, Scrapy inserts blank lines into the csv output file

I have Scrapy (version 1.0.3) in which I also extract some data from a web page, and I also upload a file like this (simplified):

def extract_data(self, response): title = response.xpath('//html/head/title/text()').extract()[0].strip() my_item = MyItem() my_item['title'] = title file_url = response.xpath('...get url of file...') file_urls = [file_url] # here there can be more urls, so I'm storing like a list fi = FileItem() fi['file_urls'] = file_urls yield my_item yield fi 

in pipelines.py I just override FilePipeline to change the file name:

 from scrapy.pipelines.files import FilesPipeline class CustomFilesPipeline(FilesPipeline): def file_path(self, request, response=None, info=None): filename = format_filename(request.url) return filename 

in items.py I have:

 class MyItem(scrapy.Item): title = scrapy.Field() class FileItem(scrapy.Item): file_urls = scrapy.Field() files = scrapy.Field() 

in settings.py I have:

 ITEM_PIPELINES = { 'myscraping.pipelines.CustomFilesPipeline': 100 } 

now in the csv output file I get something like this:

 title1 title2 , , title3 etc. 

It seems that the empty lines (having only a comma) represent the downloaded file, and I would like to know or get recommendations on how to prevent such lines from appearing in the csv output file. (files are saved in a folder).
In the Scrapy settings, I found out about FEED_STORE_EMPTY (by default it is false, i.e. It should not export empty channels), but this does not apply to the files that I assume.
I feel that this should do something with pipelines, but I cannot figure out how to do this.
any help would be appreciated

+5
source share
1 answer

Paste the answer here:

  def extract_data (self, response):
     title = response.xpath ('// html / head / title / text ()'). extract () [0] .strip ()
     my_item = MyItem ()
     my_item ['title'] = title    
     file_url = response.xpath ('... get url of file ...')
     my_item ['file_urls'] = [file_url]
     yield my_item
0
source

Source: https://habr.com/ru/post/1233701/


All Articles