I found many tutorials on Scrapy(like this good tutorial ), for which all of the steps below are for. The result is a project with a large number of files ( project.cfg+ some .pyfiles + a specific folder structure).
How to make the steps (listed below) work as a standalone python file that can be run using python mycrawler.py ?
(instead of a complete project with a large number of files, some .cfg files, etc., and by the way, use scrapy crawl myproject -o myproject.json..., it seems that Scrapyis a new shell command? is it true?)
Note: there may be an answer to this question , but unfortunately it is outdated and no longer works.
1) Create a new project with scrapy startproject myproject
2) Define the data structure using Itemas follows:
from scrapy.item import Item, Field
class MyItem(Item):
title = Field()
link = Field()
...
3) Identify the crawler using
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class MySpider(BaseSpider):
name = "myproject"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
...
4) Run with:
scrapy crawl myproject -o myproject.json
source
share