Scrapy - getting duplicate elements with JOBDIR

Configuring Scrapy JOBDIR provides the renewable workarounds described here:

http://doc.scrapy.org/en/latest/topics/jobs.html

I am trying to execute the crawl command as follows:

scrapy crawl myspider -o out.csv -t csv -s JOBDIR=./jobs/run-1

While it still works, I gracefully close it by pressing CTRL-C. Then run the same command again to resume it. I can confirm that he resumed scanning from terminal output:

[myspider] INFO: Resuming crawl (74 requests scheduled)

But when I look at the output CSV file, I see that there are duplicate elements like this:

name,email
Alice,alice@example.com
Bob,bob@example.com
...
name,email            <- duplicated header!
Bob,bob@example.com   <- duplicated row!
...

This is normal? I wonder whether it is possible to use the option -o, and JOBDIRin the same team. If not, how do I export workarounds?

By the way, I am using Scrapy 0.22.1.

Thank!

+4
1

, . scrapy, CsvItemExporter, , / . . , , : include_headers_line. : _headers_not_written, , , . reset , , , , :

class CsvItemExporter(BaseItemExporter):

    def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):

        ....
        self._headers_not_written = True
        ....

    def export_item(self, item):
        if self._headers_not_written:
            self._headers_not_written = False
            self._write_headers_and_set_fields_to_export(item)

, -o , , :

class Command(ScrapyCommand):

    ....

    def add_options(self, parser):
        ScrapyCommand.add_options(self, parser)
        parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE", \
            help="set spider argument (may be repeated)")
        parser.add_option("-o", "--output", metavar="FILE", \
            help="dump scraped items into FILE (use - for stdout)")
        parser.add_option("-t", "--output-format", metavar="FORMAT", default="jsonlines", \
            help="format to use for dumping items with -o (default: %default)")
+4

Source: https://habr.com/ru/post/1530457/


All Articles