How to create a custom item exporter?

I am trying to create my own instance of the Scrapy Item Exporter based on the JsonLinesItemExporter, so I can slightly modify the structure that it creates.

I read the documentation here http://doc.scrapy.org/en/latest/topics/exporters.html , but it does not indicate how to create a custom exporter, where to store it or how to link it to your pipeline.

I have determined how to place an order using Feed Exporters, but this does not meet my requirements, since I want to call this exporter from my conveyor.

Here is the code I found that was saved in a file in the root of the exporters.py project

 from scrapy.contrib.exporter import JsonLinesItemExporter class FanItemExporter(JsonLinesItemExporter): def __init__(self, file, **kwargs): self._configure(kwargs, dont_fail=True) self.file = file self.encoder = ScrapyJSONEncoder(**kwargs) self.first_item = True def start_exporting(self): self.file.write("""{ 'product': [""") def finish_exporting(self): self.file.write("]}") def export_item(self, item): if self.first_item: self.first_item = False else: self.file.write(',\n') itemdict = dict(self._get_serialized_fields(item)) self.file.write(self.encoder.encode(itemdict)) 

I just tried calling it from my pipeline using FanItemExporter and trying the import options, but that didn't bring anything.

+8
source share
1 answer

It is true that the Scrapy documentation does not indicate where to place the exporter of items. To use the Element Exporter, follow these steps:

  1. Select the Item Exporter class and import it into pipeline.py in the project directory. This can be a predefined item exporter (e.g. XmlItemExporter ) or custom (e.g. FanItemExporter defined in the question)
  2. Create the Item Pipeline class in pipeline.py Create an imported item exporter in this class. Details will be explained later in the answer.
  3. Now register this pipeline class in the settings.py file.

The following is a detailed explanation of each step. A solution to the issue is included in every step.

Step 1

  • If you are using the predefined Item Exporter class, import it from the scrapy.exporters module.
    Example: from scrapy.exporters import XmlItemExporter

  • If you need a custom exporter, define a custom class in the file. I suggest placing the class in the exporters.py file. Place this file in the project folder (where settings.py , items.py ).

    When creating a new subclass, it is always useful to import BaseItemExporter . It would be appropriate if we intend to completely change the functionality. However, on this issue, most of the functionality is close to JsonLinesItemExporter .

Therefore, I am enclosing two versions of the same ItemExporter. One version extends the BaseItemExporter class and another extends the JsonLinesItemExporter class

Version 1: BaseItemExporter

Since BaseItemExporter is the parent class, start_exporting() , finish_exporting() , export_item() should be redefined according to our needs.

 from scrapy.exporters import BaseItemExporter from scrapy.utils.serialize import ScrapyJSONEncoder from scrapy.utils.python import to_bytes class FanItemExporter(BaseItemExporter): def __init__(self, file, **kwargs): self._configure(kwargs, dont_fail=True) self.file = file self.encoder = ScrapyJSONEncoder(**kwargs) self.first_item = True def start_exporting(self): self.file.write(b'{\'product\': [') def finish_exporting(self): self.file.write(b'\n]}') def export_item(self, item): if self.first_item: self.first_item = False else: self.file.write(b',\n') itemdict = dict(self._get_serialized_fields(item)) self.file.write(to_bytes(self.encoder.encode(itemdict))) 

Version 2: JsonLinesItemExporter

JsonLinesItemExporter provides exactly the same implementation of the export_item() method. Therefore, only start_exporting() and finish_exporting() overridden.

The implementation of JsonLinesItemExporter can be seen in the python_dir\pkgs\scrapy-1.1.0-py35_0\Lib\site-packages\scrapy\exporters.py

 from scrapy.exporters import JsonItemExporter class FanItemExporter(JsonItemExporter): def __init__(self, file, **kwargs): # To initialize the object using JsonItemExporter constructor super().__init__(file) def start_exporting(self): self.file.write(b'{\'product\': [') def finish_exporting(self): self.file.write(b'\n]}') 

Note When writing data to a file, it is important to consider that the standard Item Exporter classes expect binary files. Therefore, the file must be opened in binary mode ( b ). For the same reason, the write() method in both versions writes bytes to a file.

Step 2

Creating an element pipeline class.

 from project_name.exporters import FanItemExporter class FanExportPipeline(object): def __init__(self, file_name): # Storing output filename self.file_name = file_name # Creating a file handle and setting it to None self.file_handle = None @classmethod def from_crawler(cls, crawler): # getting the value of FILE_NAME field from settings.py output_file_name = crawler.settings.get('FILE_NAME') # cls() calls FanExportPipeline constructor # Returning a FanExportPipeline object return cls(output_file_name) def open_spider(self, spider): print('Custom export opened') # Opening file in binary-write mode file = open(self.file_name, 'wb') self.file_handle = file # Creating a FanItemExporter object and initiating export self.exporter = FanItemExporter(file) self.exporter.start_exporting() def close_spider(self, spider): print('Custom Exporter closed') # Ending the export to file from FanItemExport object self.exporter.finish_exporting() # Closing the opened output file self.file_handle.close() def process_item(self, item, spider): # passing the item to FanItemExporter object for expoting to file self.exporter.export_item(item) return item 

Step 3

Since a product export pipeline is defined, register this pipeline in the settings.py file. Also add the FILE_NAME field to the settings.py file. This field contains the file name of the output file.

Add the following lines to the settings.py file.

 FILE_NAME = 'path/outputfile.ext' ITEM_PIPELINES = { 'project_name.pipelines.FanExportPipeline' : 600, } 

If ITEM_PIPELINES is not already commented out, add the following line to the ITEM_PIPELINES dictionary.

'project_name.pipelines.FanExportPipeline': 600,

This is one way to create a custom product export pipeline.

NOTE:

+15
source

Source: https://habr.com/ru/post/1234321/


All Articles