Trying to capture only absolute links from a web page using BeautifulSoup

Question

Trying to capture only absolute links from a web page using BeautifulSoup

I am reading the contents of a webpage using BeautifulSoup. I want to just grab the <a href> that starts with http:// . I know that in beautifulsoup you can search by attributes. I guess I just have a syntax problem. I would suggest that it would be similar.

 page = urllib2.urlopen("http://www.linkpages.com") soup = BeautifulSoup(page) for link in soup.findAll('a'): if link['href'].startswith('http://'): print links

But this returns:

 Traceback (most recent call last): File "<stdin>", line 2, in <module> File "C:\Python26\lib\BeautifulSoup.py", line 598, in __getitem__ return self._getAttrMap()[key] KeyError: 'href'

Any ideas? Thanks in advance.

EDIT This is not for any site in particular. The script gets the url from the user. So problems with internal communication will be a problem, so I only want <'a'> from the pages. If I turn it to www.reddit.com , it will analyze the initial links and get to this:

 <a href="http://www.reddit.com/top/">top</a> <a href="http://www.reddit.com/saved/">saved</a> Traceback (most recent call last): File "<stdin>", line 2, in <module> File "C:\Python26\lib\BeautifulSoup.py", line 598, in __getitem__ return self._getAttrMap()[key] KeyError: 'href'

+4

python beautifulsoup

Kevin Mar 23 '10 at 17:22

source share

4 answers

Perhaps there are <a> tags without href attributes? Internal link targets, perhaps?

+1

Andrew Aylett Mar 23 '10 at 17:25

source share

Please give us an idea of what you are analyzing here, ”Andrey points out, it seems likely that there are some anchor tags without href related.

If you really want to ignore them, you can wrap them in a try block and catch them later

except KeyError: pass

But this has its problems.

0

Brighid mcdonnell Mar 23 '10 at 17:32

source share

 f=open('Links.txt','w') import urllib2 from bs4 import BeautifulSoup url='http://www.redit.com' page=urllib2.urlopen(url) soup=BeautifulSoup(page) atags=soup.find_all('a') for item in atags: for x in item.attrs: if x=='href': f.write(item.attrs[x]+',\n') else: continue f.close()

Less effective solution.

0

Alex Feb 16 '13 at 0:16

source share

alex vasi · Accepted Answer · 2010-03-23T17:38:05+0000

 from BeautifulSoup import BeautifulSoup import re import urllib2 page = urllib2.urlopen("http://www.linkpages.com") soup = BeautifulSoup(page) for link in soup.findAll('a', attrs={'href': re.compile("^http://")}): print link

Trying to capture only absolute links from a web page using BeautifulSoup

More articles: