How to find backlinks on a website using python

Question

How to find backlinks on a website using python

I’m kind of stuck in this situation, I want to find backlinks to sites, I can’t find how to do this, here is my regular expression:

readh = BeautifulSoup(urllib.urlopen("http://www.google.com/").read()).findAll("a",href=re.compile("^http"))

What I want to do to find backlinks is that links that start with http, but not links that include google, and I can't figure out how to do this?

+4

python regex beautifulsoup

user2682790 Aug 14 '13 at 14:19

source share

2 answers

7stud · Answer 1 · 2013-08-14T14:50:56+0000

 from BeautifulSoup import BeautifulSoup import re html = """ <div>hello</div> <a href="/index.html">Not this one</a>" <a href="http://google.com">Link 1</a> <a href="http:/amazon.com">Link 2</a> """ def processor(tag): href = tag.get('href') if not href: return False return True if (href.find("google") == -1) else False soup = BeautifulSoup(html) back_links = soup.findAll(processor, href=re.compile(r"^http")) print back_links --output:-- [<a href="http:/amazon.com">Link 2</a>]

However, it may be more efficient to simply get all the links starting with http, and then look for those links for links that don't have "google" in their hrefs:

 http_links = soup.findAll('a', href=re.compile(r"^http")) results = [a for a in http_links if a['href'].find('google') == -1] print results --output:-- [<a href="http:/amazon.com">Link 2</a>]

vegi · Answer 2 · 2013-08-14T14:53:17+0000

Here is a regex that matches http pages but not including google:

 re.compile("(?!.*google)^http://(www.)?.*")

How to find backlinks on a website using python

More articles: