Problem with BeautifulSoup KeyError

I know that KeyErrors are pretty common with BeautifulSoup, and before you shout at RTFM, I did an extensive reading in both the Python documentation and the BeautifulSoup documentation. Now, aside, I still don't know what is happening with KeyErrors.

Here's the program I'm trying to run, which constantly and sequentially raises a KeyError in the last element of the URL list.

I come from C ++ background, just so you know, but I need to use BeautifulSoup to work, doing it in C ++ is a nightmare!

The idea is to return a list of all the URLs on the website that contain links to a specific URL on their pages.

Here is what I got so far:

import urllib from BeautifulSoup import BeautifulSoup URLs = [] Locations = [] URLs.append("http://www.tuftsalumni.org") def print_links (link): if (link.startswith('/') or link.startswith('http://www.tuftsalumni')): if (link.startswith('/')): link = "STARTING_WEBSITE" + link print (link) htmlSource = urllib.urlopen(link).read(200000) soup = BeautifulSoup(htmlSource) for item in soup.fetch('a'): if (item['href'].startswith('/') or "tuftsalumni" in item['href']): URLs.append(item['href']) length = len(URLs) if (item['href'] == "SITE_ON_PAGE"): if (check_list(link, Locations) == "no"): Locations.append(link) def check_list (link, array): for x in range (0, len(array)): if (link == array[x]): return "yes" return "no" print_links(URLs[0]) for x in range (0, (len(URLs))): print_links(URLs[x]) 

The error I am getting is next to the last element of the urls:

 File "scraper.py", line 35, in <module> print_links(URLs[x]) File "scraper.py", line 16, in print_links if (item['href'].startswith('/') or File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site- packages/BeautifulSoup.py", line 613, in __getitem__ return self._getAttrMap()[key] KeyError: 'href' 

Now I know that I need to use get () to handle the default KeyError case. I have absolutely no idea how to do this, despite literally an hour of searching.

Thank you, if I can clarify this, please let me know.

+4
source share
1 answer

If you just want to handle the error, you can catch the exception:

  for item in soup.fetch('a'): try: if (item['href'].startswith('/') or "tuftsalumni" in item['href']): (...) except KeyError: pass # or some other fallback action 

You can specify a default value using item.get('key','default') , but I don't think you need it in this case.

Edit: if all else fails, this is the barebone version, which should be a reasonable starting point:

 #!/usr/bin/env python # -*- coding: utf-8 -*- import urllib from BeautifulSoup import BeautifulSoup links = ["http://www.tuftsalumni.org"] def print_hrefs(link): htmlSource = urllib.urlopen(link).read() soup = BeautifulSoup(htmlSource) for item in soup.fetch('a'): print item['href'] for link in links: print_hrefs(link) 

In addition, check_list(item, l) can be replaced with item in l .

+5
source

Source: https://habr.com/ru/post/1400310/


All Articles