By default, you cannot access the source URL.
But you can override the make_requests_from_url
method and put the start URL in meta
. Then, in the parsing, you can extract it from there (if you succumb to subsequent requests in this syntax method, be sure to redirect this beginning to them).
I did not work with CrawlSpider
and maybe what Maxim offers will work for you, but keep in mind that response.url
has a URL after possible redirects.
Here is an example of how I will do this, but this is just an example (taken from a table of books on tablets) and has not been tested:
class MySpider(CrawlSpider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.example.com'] rules = (
Ask if you have any questions. BTW, using PyDev's 'Go to definition' function, you can see the sources of radiation therapy and understand what Request
, make_requests_from_url
and other classes and methods expect. Entering the code helps and saves you time, although at first it may seem difficult.
source share