I know that KeyErrors are pretty common with BeautifulSoup, and before you shout at RTFM, I did an extensive reading in both the Python documentation and the BeautifulSoup documentation. Now, aside, I still don't know what is happening with KeyErrors.
Here's the program I'm trying to run, which constantly and sequentially raises a KeyError in the last element of the URL list.
I come from C ++ background, just so you know, but I need to use BeautifulSoup to work, doing it in C ++ is a nightmare!
The idea is to return a list of all the URLs on the website that contain links to a specific URL on their pages.
Here is what I got so far:
import urllib from BeautifulSoup import BeautifulSoup URLs = [] Locations = [] URLs.append("http://www.tuftsalumni.org") def print_links (link): if (link.startswith('/') or link.startswith('http://www.tuftsalumni')): if (link.startswith('/')): link = "STARTING_WEBSITE" + link print (link) htmlSource = urllib.urlopen(link).read(200000) soup = BeautifulSoup(htmlSource) for item in soup.fetch('a'): if (item['href'].startswith('/') or "tuftsalumni" in item['href']): URLs.append(item['href']) length = len(URLs) if (item['href'] == "SITE_ON_PAGE"): if (check_list(link, Locations) == "no"): Locations.append(link) def check_list (link, array): for x in range (0, len(array)): if (link == array[x]): return "yes" return "no" print_links(URLs[0]) for x in range (0, (len(URLs))): print_links(URLs[x])
The error I am getting is next to the last element of the urls:
File "scraper.py", line 35, in <module> print_links(URLs[x]) File "scraper.py", line 16, in print_links if (item['href'].startswith('/') or File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site- packages/BeautifulSoup.py", line 613, in __getitem__ return self._getAttrMap()[key] KeyError: 'href'
Now I know that I need to use get () to handle the default KeyError case. I have absolutely no idea how to do this, despite literally an hour of searching.
Thank you, if I can clarify this, please let me know.