How to access specific start_url in Scroll CrawlSpider?

I use Scrapy, in particular Scrapy CrawlSpider to clean web links containing specific keywords. I have a rather long start_urls list that gets its records from the SQLite database, which is associated with the Django project. I want to keep cleared web links in this database.

I have two Django models, one for start URLs like http://example.com and one for cleared web links like http://example.com/website1 , http://example.com/website2 etc. All cleared web links are children of one of the start URLs in the start_urls list.

The web link model has a many-to-one relationship to the start URL model, that is, the web link model has the Beyond Key parameter for the start URL model. In order to correctly save my cleared web links in the database, I need to specify the CrawlSpider parse_item() method, which will launch the URL that the parse_item() link refers to. How can i do this? The Scrapy DjangoItem class does not help in this regard, since I still have to explicitly determine the starting URL to use.

In other words, how do I pass the currently used launch URL to the parse_item() method so that I can save it along with the corresponding cleared web links to the database? Any ideas? Thanks in advance!

+6
source share
3 answers

By default, you cannot access the source URL.

But you can override the make_requests_from_url method and put the start URL in meta . Then, in the parsing, you can extract it from there (if you succumb to subsequent requests in this syntax method, be sure to redirect this beginning to them).


I did not work with CrawlSpider and maybe what Maxim offers will work for you, but keep in mind that response.url has a URL after possible redirects.

Here is an example of how I will do this, but this is just an example (taken from a table of books on tablets) and has not been tested:

 class MySpider(CrawlSpider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.example.com'] rules = ( # Extract links matching 'category.php' (but not matching 'subsection.php') # and follow links from them (since no callback means follow=True by default). Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))), # Extract links matching 'item.php' and parse them with the spider method parse_item Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'), ) def parse(self, response): # When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work. for request_or_item in CrawlSpider.parse(self, response): if isinstance(request_or_item, Request): request_or_item = request_or_item.replace(meta = {'start_url': response.meta['start_url']}) yield request_or_item def make_requests_from_url(self, url): """A method that receives a URL and returns a Request object (or a list of Request objects) to scrape. This method is used to construct the initial requests in the start_requests() method, and is typically used to convert urls to requests. """ return Request(url, dont_filter=True, meta = {'start_url': url}) def parse_item(self, response): self.log('Hi, this is an item page! %s' % response.url) hxs = HtmlXPathSelector(response) item = Item() item['id'] = hxs.select('//td[@id="item_id"]/text()').re(r'ID: (\d+)') item['name'] = hxs.select('//td[@id="item_name"]/text()').extract() item['description'] = hxs.select('//td[@id="item_description"]/text()').extract() item['start_url'] = response.meta['start_url'] return item 

Ask if you have any questions. BTW, using PyDev's 'Go to definition' function, you can see the sources of radiation therapy and understand what Request , make_requests_from_url and other classes and methods expect. Entering the code helps and saves you time, although at first it may seem difficult.

+8
source

If I understand the problem correctly, you can get the URL from response.url and then write it to item['url'] .

In Spider: item['url'] = response.url

And in the pipeline: url = item['url'] .

Or put response.url in meta , as warvariuc wrote.

+1
source

It looks like warvariuc's answer needs a little modification with Scrapy 1.3.3: you need to override _parse_response instead of parse . Overriding make_requests_from_url no longer required.

+1
source

Source: https://habr.com/ru/post/915777/


All Articles