Scrapy: store all external links and bypass all reciprocal links

I am working on a scrapper scrapy that scans all internal links from the source URL and only collects external links with scrapy . However, my main problem is the classification of external links and internal links. For example, when I try to filter out external links using link.startswith("http") or link.startswith("ftp") or link.startswith("www") if the website links its own website with an absolute path ( www.my-domain.com/about instead of /about ), then it will classify it as an external link, even if it is not. Below is my code:

 import scrapy from lab_relationship.items import Links class WebSpider(scrapy.Spider): name = "web" allowed_domains = ["my-domain.com"] start_urls = ( 'www.my-domain.com', ) def parse(self, response): """ finds all external links""" items = [] for link in set(response.xpath('//a/@href').extract()): item = Links() if len(link) > 1: if link.startswith("/") or link.startswith("."): # internal link url = response.urljoin(link) item['internal'] = url #yield scrapy.Request(url, self.parse) elif link.startswith("http") or link.startswith("ftp") or link.startswith("www"): # external link item['external'] = link else: # misc. links: mailto, id (#) item['misc'] = link items.append(item) return items 

Any suggestions?

+5
source share
2 answers

Use a link extractor .

When creating an instance, you must pass the allowed domain. You do not need to worry about specifying the necessary tags, since (according to the docs) the tags parameter accepts ('a', 'area') by default.

Using the Rust lang website as an example, the code for printing all internal links from your domain will look like this:

 import scrapy from scrapy.linkextractors import LinkExtractor class RustSpider(scrapy.Spider): name = "rust" allowed_domains = ["www.rust-lang.org"] start_urls = ( 'http://www.rust-lang.org/', ) def parse(self, response): extractor = LinkExtractor(allow_domains='rust-lang.org') links = extractor.extract_links(response) for link in links: print link.url 

and the output will be a list of such links: https://doc.rust-lang.org/nightly/reference.html (I can not post more), excluding all links as for StackOverflow.

Please do not forget to check the documentation page, since the allocation channel has many parameters that you may need.

+7
source

If you can take several or operators, not just two.

-2
source

Source: https://habr.com/ru/post/1232867/


All Articles