Trying to capture only absolute links from a web page using BeautifulSoup

I am reading the contents of a webpage using BeautifulSoup. I want to just grab the <a href> that starts with http:// . I know that in beautifulsoup you can search by attributes. I guess I just have a syntax problem. I would suggest that it would be similar.

 page = urllib2.urlopen("http://www.linkpages.com") soup = BeautifulSoup(page) for link in soup.findAll('a'): if link['href'].startswith('http://'): print links 

But this returns:

 Traceback (most recent call last): File "<stdin>", line 2, in <module> File "C:\Python26\lib\BeautifulSoup.py", line 598, in __getitem__ return self._getAttrMap()[key] KeyError: 'href' 

Any ideas? Thanks in advance.

EDIT This is not for any site in particular. The script gets the url from the user. So problems with internal communication will be a problem, so I only want <'a'> from the pages. If I turn it to www.reddit.com , it will analyze the initial links and get to this:

 <a href="http://www.reddit.com/top/">top</a> <a href="http://www.reddit.com/saved/">saved</a> Traceback (most recent call last): File "<stdin>", line 2, in <module> File "C:\Python26\lib\BeautifulSoup.py", line 598, in __getitem__ return self._getAttrMap()[key] KeyError: 'href' 
+4
source share
4 answers
 from BeautifulSoup import BeautifulSoup import re import urllib2 page = urllib2.urlopen("http://www.linkpages.com") soup = BeautifulSoup(page) for link in soup.findAll('a', attrs={'href': re.compile("^http://")}): print link 
+6
source

Perhaps there are <a> tags without href attributes? Internal link targets, perhaps?

+1
source

Please give us an idea of ​​what you are analyzing here, ”Andrey points out, it seems likely that there are some anchor tags without href related.

If you really want to ignore them, you can wrap them in a try block and catch them later

except KeyError: pass

But this has its problems.

0
source
 f=open('Links.txt','w') import urllib2 from bs4 import BeautifulSoup url='http://www.redit.com' page=urllib2.urlopen(url) soup=BeautifulSoup(page) atags=soup.find_all('a') for item in atags: for x in item.attrs: if x=='href': f.write(item.attrs[x]+',\n') else: continue f.close() 

Less effective solution.

0
source

Source: https://habr.com/ru/post/1304912/


All Articles