Use a link extractor .
When creating an instance, you must pass the allowed domain. You do not need to worry about specifying the necessary tags, since (according to the docs) the tags parameter accepts ('a', 'area') by default.
Using the Rust lang website as an example, the code for printing all internal links from your domain will look like this:
import scrapy from scrapy.linkextractors import LinkExtractor class RustSpider(scrapy.Spider): name = "rust" allowed_domains = ["www.rust-lang.org"] start_urls = ( 'http://www.rust-lang.org/', ) def parse(self, response): extractor = LinkExtractor(allow_domains='rust-lang.org') links = extractor.extract_links(response) for link in links: print link.url
and the output will be a list of such links: https://doc.rust-lang.org/nightly/reference.html (I can not post more), excluding all links as for StackOverflow.
Please do not forget to check the documentation page, since the allocation channel has many parameters that you may need.
source share