Scrapy CrawlSpider + Splash: how to follow links via linkextractor?

Question

Scrapy CrawlSpider + Splash: how to follow links via linkextractor?

I have the following code that partially works,

class ThreadSpider(CrawlSpider):
    name = 'thread'
    allowed_domains = ['bbs.example.com']
    start_urls = ['http://bbs.example.com/diy']

    rules = (
        Rule(LinkExtractor(
            allow=(),
            restrict_xpaths=("//a[contains(text(), 'Next Page')]")
        ),
            callback='parse_item',
            process_request='start_requests',
            follow=True),
    )

def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(url, self.parse_item, args={'wait': 0.5})

def parse_item(self, response):
    # item parser

the code will only work for start_urls, but will not follow the links specified in restricted_xpaths, if I comment on the method start_requests()and the line process_request='start_requests',in the rules, it will run and follow the links in, of course, without js rendering.

I read two related questions: CrawlSpider with Splash stuck after the first URL and CrawlSpider with Splash and specifically changed scrapy.Request()to SplashRequest()in the method start_requests(), but this does not seem to work. What is wrong with my code? Thank,

+4

python web-crawler scrapy scrapy-splash

eN_Joy Aug 25 '17 at 16:45

3

Hanan 'John' Goldstein · Answer 1 · 2017-12-11T16:17:15+0000

, Splash Scrapy CrawlSpider. URL, . , , - scrapy-splash "process_links", URL- Splash http- . , , . :

, URL- , , - .

from urllib.parse import urlencode, parse_qs

URL- , scrapy " ", make "localhost" .

allowed_domains = ['localhost']
start_urls = ['https://www.example.com/']

, -, . LinkExtractor. , .

LinkExtractor(allow=r'(http(s)?://)?(.*\.)?{}.*'.format(r'example.com')),
process_links='process_links',

process_links. urlencode - , splash.

def process_links(self, links):
    for link in links:
        if "http://localhost:8050/render.html?&" not in link.url:
            link.url = "http://localhost:8050/render.html?&" + urlencode({'url':link.url,
                                                                          'wait':2.0})
    return links

, URL- URL- , parse_qs.

parse_qs(response.url)['url'][0]

. , '&' URL- . (... render.html &). URL- , URL-, , , urlencode.

Kapil · Answer 2 · 2017-08-26T16:55:20+0000

-

restrict_xpaths=('//a[contains(text(), "Next Page")]')

restrict_xpaths=("//a[contains(text(), 'Next Page')]")

head7 · Answer 3 · 2018-03-24T14:46:41+0000

, https://github.com/scrapy-plugins/scrapy-splash/issues/92

Personnaly dont_process_response = True, - HtmlResponse ( _request_to_follows).

_build_request spyder, :

def _build_request(self, rule, link):
    r = SplashRequest(url=link.url, callback=self._response_downloaded, args={'wait': 0.5}, dont_process_response=True)
    r.meta.update(rule=rule, link_text=link.text)
    return r

In github tasks, some users simply override the _request_to_follow method in their class.

Scrapy CrawlSpider + Splash: how to follow links via linkextractor?

More articles: