How to access command line parameter in crawlspider in scrapy?

Question

How to access command line parameter in crawlspider in scrapy?

I want to pass a parameter on the command line scrapy crawl ...that will be used in the definition of the rule in the advanced CrawlSpider , for example, after

name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']

rules = (
    # Extract links matching 'category.php' (but not matching 'subsection.php')
    # and follow links from them (since no callback means follow=True by default).
    Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

    # Extract links matching 'item.php' and parse them with the spider method parse_item
    Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
)

I want the allow attribute in SgmlLinkExtractor to be specified in the command line parameter. I googled and found that I can get the parameter value in the spider method __init__, but how can I get the parameter on the command line that will be used in the rule definition?

+4

python scrapy

David Apr 29 '14 at 3:47

source share

1 answer

paul trmbrth · Answer 1 · 2014-04-29T08:54:08+0000

You can create your Spider attribute rulesin a method __init__, for example:

class MySpider(CrawlSpider):

    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    def __init__(self, allow=None, *args, **kwargs):
        self.rules = (
            Rule(SgmlLinkExtractor(allow=(self.allow,),)),
        )
        super(MySpider, self).__init__(*args, **kwargs)

allow :

scrapy crawl example.com -a allow="item\.php"

How to access command line parameter in crawlspider in scrapy?

More articles: