Web crawler in standalone python file

Question

Web crawler in standalone python file

I found many tutorials on Scrapy(like this good tutorial ), for which all of the steps below are for. The result is a project with a large number of files ( project.cfg+ some .pyfiles + a specific folder structure).

How to make the steps (listed below) work as a standalone python file that can be run using python mycrawler.py ?

(instead of a complete project with a large number of files, some .cfg files, etc., and by the way, use scrapy crawl myproject -o myproject.json..., it seems that Scrapyis a new shell command? is it true?)

Note: there may be an answer to this question , but unfortunately it is outdated and no longer works.

1) Create a new project with scrapy startproject myproject

2) Define the data structure using Itemas follows:

from scrapy.item import Item, Field
    class MyItem(Item):
        title = Field() 
        link = Field()
        ...

3) Identify the crawler using

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class MySpider(BaseSpider):
    name = "myproject"
    allowed_domains = ["example.com"] 
    start_urls = ["http://www.example.com"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        ...

4) Run with:

scrapy crawl myproject -o myproject.json

+4

python web-crawler web-scraping scrapy

Basj Jan 2 '15 at 12:51

source share

2 answers

Scrapy - unix, , python, javac, gcc ..
bcz u , , . , , bash script , - .

, urllib3,

+1

aibotnet 02 . '15 16:03

pad · Accepted Answer · 2015-01-02T17:58:08+0000

You can run scrapy spiders as a single script without starting a project using runspider Is this what you wanted?

#myscript.py
from scrapy.item import Item, Field
from scrapy import Spider

class MyItem(Item):
    title = Field() 
    link = Field()

class MySpider(Spider):

     start_urls = ['http://www.example.com']
     name = 'samplespider'

     def parse(self, response):
          item = MyItem()
          item['title'] = response.xpath('//h1/text()').extract()
          item['link'] = response.url
          yield item

Now you can run this with scrapy runspider myscript.py -o out.json

Web crawler in standalone python file

More articles: