It is true that the Scrapy documentation does not indicate where to place the exporter of items. To use the Element Exporter, follow these steps:
- Select the Item Exporter class and import it into
pipeline.py in the project directory. This can be a predefined item exporter (e.g. XmlItemExporter ) or custom (e.g. FanItemExporter defined in the question) - Create the Item Pipeline class in
pipeline.py Create an imported item exporter in this class. Details will be explained later in the answer. - Now register this pipeline class in the
settings.py file.
The following is a detailed explanation of each step. A solution to the issue is included in every step.
Step 1
If you are using the predefined Item Exporter class, import it from the scrapy.exporters module.
Example: from scrapy.exporters import XmlItemExporter
If you need a custom exporter, define a custom class in the file. I suggest placing the class in the exporters.py file. Place this file in the project folder (where settings.py , items.py ).
When creating a new subclass, it is always useful to import BaseItemExporter . It would be appropriate if we intend to completely change the functionality. However, on this issue, most of the functionality is close to JsonLinesItemExporter .
Therefore, I am enclosing two versions of the same ItemExporter. One version extends the BaseItemExporter class and another extends the JsonLinesItemExporter class
Version 1: BaseItemExporter
Since BaseItemExporter is the parent class, start_exporting() , finish_exporting() , export_item() should be redefined according to our needs.
from scrapy.exporters import BaseItemExporter from scrapy.utils.serialize import ScrapyJSONEncoder from scrapy.utils.python import to_bytes class FanItemExporter(BaseItemExporter): def __init__(self, file, **kwargs): self._configure(kwargs, dont_fail=True) self.file = file self.encoder = ScrapyJSONEncoder(**kwargs) self.first_item = True def start_exporting(self): self.file.write(b'{\'product\': [') def finish_exporting(self): self.file.write(b'\n]}') def export_item(self, item): if self.first_item: self.first_item = False else: self.file.write(b',\n') itemdict = dict(self._get_serialized_fields(item)) self.file.write(to_bytes(self.encoder.encode(itemdict)))
Version 2: JsonLinesItemExporter
JsonLinesItemExporter provides exactly the same implementation of the export_item() method. Therefore, only start_exporting() and finish_exporting() overridden.
The implementation of JsonLinesItemExporter can be seen in the python_dir\pkgs\scrapy-1.1.0-py35_0\Lib\site-packages\scrapy\exporters.py
from scrapy.exporters import JsonItemExporter class FanItemExporter(JsonItemExporter): def __init__(self, file, **kwargs):
Note When writing data to a file, it is important to consider that the standard Item Exporter classes expect binary files. Therefore, the file must be opened in binary mode ( b ). For the same reason, the write() method in both versions writes bytes to a file.
Step 2
Creating an element pipeline class.
from project_name.exporters import FanItemExporter class FanExportPipeline(object): def __init__(self, file_name):
Step 3
Since a product export pipeline is defined, register this pipeline in the settings.py file. Also add the FILE_NAME field to the settings.py file. This field contains the file name of the output file.
Add the following lines to the settings.py file.
FILE_NAME = 'path/outputfile.ext' ITEM_PIPELINES = { 'project_name.pipelines.FanExportPipeline' : 600, }
If ITEM_PIPELINES is not already commented out, add the following line to the ITEM_PIPELINES dictionary.
'project_name.pipelines.FanExportPipeline': 600,
This is one way to create a custom product export pipeline.
NOTE: