How to use scrapy

Question

How to use scrapy

I would like to know how I can run a Scrapy-based scanner. I installed the tool through apt-get install, and I tried to run the example:

/ usr / share / doc / scrapy / examples / googledir / googledir $ scrapy list
directory.google.com

/ usr / share / doc / scrapy / examples / googledir / googledir $ scrapy crawl

I hacked the code from spiders / google_directory.py, but it seems that it is not running, because I do not see any fingerprints that I inserted. I read their documentation, but I did not find anything related to this; do you have any ideas

Also, if you think that I should use other tools to crawl the website, please let me know. I am not familiar with Python tools, and Python is a must.

Thank!

+3

python web-crawler scrapy

Laurențiu Dascălu Sep 22 '10 at 19:46

source share

2 answers

EveryBlock.com has released some quality improvement code using lxml, urllib2 and Django as its stack.

Scraperwiki.com is an inspirational, complete python scraper example.

A simple example using cssselect:

from lxml.html import fromstring

dom = fromstring('<html... ...')
navigation_links = [a.get('href') for a in htm.cssselect('#navigation a')]

+7

F. malina Sep 22 '10 at 22:35

source share

Pablo hoffman · Accepted Answer · 2010-09-23T03:36:50+0000

You missed the spider name in the crawl command. Using:

$ scrapy crawl directory.google.com

In addition, I suggest you copy the sample project into your home, instead of working in a directory /usr/share/doc/scrapy/examples/, so you can change it and play with it:

$ cp -r /usr/share/doc/scrapy/examples/googledir ~
$ cd ~/googledir
$ scrapy crawl directory.google.com

How to use scrapy

More articles: