I am reading the contents of a webpage using BeautifulSoup. I want to just grab the <a href> that starts with http:// . I know that in beautifulsoup you can search by attributes. I guess I just have a syntax problem. I would suggest that it would be similar.
page = urllib2.urlopen("http://www.linkpages.com") soup = BeautifulSoup(page) for link in soup.findAll('a'): if link['href'].startswith('http://'): print links
But this returns:
Traceback (most recent call last): File "<stdin>", line 2, in <module> File "C:\Python26\lib\BeautifulSoup.py", line 598, in __getitem__ return self._getAttrMap()[key] KeyError: 'href'
Any ideas? Thanks in advance.
EDIT This is not for any site in particular. The script gets the url from the user. So problems with internal communication will be a problem, so I only want <'a'> from the pages. If I turn it to www.reddit.com , it will analyze the initial links and get to this:
<a href="http://www.reddit.com/top/">top</a> <a href="http://www.reddit.com/saved/">saved</a> Traceback (most recent call last): File "<stdin>", line 2, in <module> File "C:\Python26\lib\BeautifulSoup.py", line 598, in __getitem__ return self._getAttrMap()[key] KeyError: 'href'
Kevin source share