How to find backlinks on a website using python

I’m kind of stuck in this situation, I want to find backlinks to sites, I can’t find how to do this, here is my regular expression:

readh = BeautifulSoup(urllib.urlopen("http://www.google.com/").read()).findAll("a",href=re.compile("^http")) 

What I want to do to find backlinks is that links that start with http, but not links that include google, and I can't figure out how to do this?

+4
source share
2 answers
 from BeautifulSoup import BeautifulSoup import re html = """ <div>hello</div> <a href="/index.html">Not this one</a>" <a href="http://google.com">Link 1</a> <a href="http:/amazon.com">Link 2</a> """ def processor(tag): href = tag.get('href') if not href: return False return True if (href.find("google") == -1) else False soup = BeautifulSoup(html) back_links = soup.findAll(processor, href=re.compile(r"^http")) print back_links --output:-- [<a href="http:/amazon.com">Link 2</a>] 

However, it may be more efficient to simply get all the links starting with http, and then look for those links for links that don't have "google" in their hrefs:

 http_links = soup.findAll('a', href=re.compile(r"^http")) results = [a for a in http_links if a['href'].find('google') == -1] print results --output:-- [<a href="http:/amazon.com">Link 2</a>] 
+3
source

Here is a regex that matches http pages but not including google:

 re.compile("(?!.*google)^http://(www.)?.*") 
+2
source

Source: https://habr.com/ru/post/1497044/


All Articles