I am writing a scanner in Python that crawls all pages in a specific domain as part of a domain-specific search engine. I use Django, Scrapy and Celery to achieve this. The scenario is as follows:
I get the domain name from the user and call the task crawlinside the view, passing the domain as an argument:
crawl.delay(domain)
The task itself simply calls a function that starts the crawl process:
from .crawler.crawl import run_spider
from celery import shared_task
@shared_task
def crawl(domain):
return run_spider(domain)
run_spiderstarts the crawl process in this SO response , replacing MySpiderwith WebSpider.
WebSpiderinherits from CrawlSpider, and now I use it only to test functionality. The only rule that is set is to accept an instance SgmlLinkExtractorand a callback function parse_pagethat simply retrieves the response URL and page title, populates it with new DjangoItem ( HTMLPageItem) and stores it in the database (not as efficiently, I know).
from urlparse import urlparse
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from ..items import HTMLPageItem
from scrapy.selector import Selector
from scrapy.contrib.spiders import Rule, CrawlSpider
class WebSpider(CrawlSpider):
name = "web"
def __init__(self, **kw):
super(WebSpider, self).__init__(**kw)
url = kw.get('domain') or kw.get('url')
if not (url.startswith('http://') or url.startswith('https://')):
url = "http://%s/" % url
self.url = url
self.allowed_domains = [urlparse(url).hostname.lstrip('www.')]
self.start_urls = [url]
self.rules = [
Rule(SgmlLinkExtractor(
allow_domains=self.allowed_domains,
unique=True), callback='parse_page', follow=True)
]
def parse_start_url(self, response):
return self.parse_page(response)
def parse_page(self, response):
sel = Selector(response)
item = HTMLPageItem()
item['url'] = response.request.url
item['title'] = sel.xpath('//title/text()').extract()[0]
item.save()
return item
Problem - The crawler only scans start_urlsand doesn’t execute the links (or calls the callback function) when running this script and using Celery. However, calling run_spiderthrough python manage.py shellworks just fine!
Another problem is that the Piperies and logging elements do not work with Celery. This makes debugging a lot harder. I think these problems may be related.