Scrapy spider does not follow links when using Celery

I am writing a scanner in Python that crawls all pages in a specific domain as part of a domain-specific search engine. I use Django, Scrapy and Celery to achieve this. The scenario is as follows:

I get the domain name from the user and call the task crawlinside the view, passing the domain as an argument:

crawl.delay(domain)

The task itself simply calls a function that starts the crawl process:

from .crawler.crawl import run_spider
from celery import shared_task

@shared_task
def crawl(domain):
    return run_spider(domain) 

run_spiderstarts the crawl process in this SO response , replacing MySpiderwith WebSpider.

WebSpiderinherits from CrawlSpider, and now I use it only to test functionality. The only rule that is set is to accept an instance SgmlLinkExtractorand a callback function parse_pagethat simply retrieves the response URL and page title, populates it with new DjangoItem ( HTMLPageItem) and stores it in the database (not as efficiently, I know).

from urlparse import urlparse
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from ..items import HTMLPageItem
from scrapy.selector import Selector
from scrapy.contrib.spiders import Rule, CrawlSpider

class WebSpider(CrawlSpider):
    name = "web"

    def __init__(self, **kw):
        super(WebSpider, self).__init__(**kw)
        url = kw.get('domain') or kw.get('url')
        if not (url.startswith('http://') or url.startswith('https://')):
            url = "http://%s/" % url
        self.url = url
        self.allowed_domains = [urlparse(url).hostname.lstrip('www.')]
        self.start_urls = [url]
        self.rules = [
            Rule(SgmlLinkExtractor(
                allow_domains=self.allowed_domains,
                unique=True), callback='parse_page', follow=True)
        ]

    def parse_start_url(self, response):
        return self.parse_page(response)

    def parse_page(self, response):
        sel = Selector(response)
        item = HTMLPageItem()
        item['url'] = response.request.url
        item['title'] = sel.xpath('//title/text()').extract()[0]
        item.save()
        return item

Problem - The crawler only scans start_urlsand doesn’t execute the links (or calls the callback function) when running this script and using Celery. However, calling run_spiderthrough python manage.py shellworks just fine!

Another problem is that the Piperies and logging elements do not work with Celery. This makes debugging a lot harder. I think these problems may be related.

+4
1

, Scrapy , web_spider.py:

from celery.utils.log import get_task_logger

logger = get_task_logger(__name__)

: WebSpider:

super(WebSpider, self).__init__(**kw)

__init__ CrawlSpider _compile_rules, self.rules self._rules, . self._rules - , , . CrawlSpider self._rules, .

super(WebSpider, self).__init__(**kw) WebSpider __init__ .

: SO-. . , WebCrawlerScript __init__ :

self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)

if, .

2: - ! . , . . :

SCRAPY_SETTINGS_MODULE django myproject/settings.py:

import os
os.environ['SCRAPY_SETTINGS_MODULE'] = 'myapp.crawler.crawler.settings'

Scrapy crawler/settings.py Scrapy sys.path, :

import sys
sys.path.append('/absolute/path/to/scrapy/project')

.

+2

Source: https://habr.com/ru/post/1544662/


All Articles