BeautifulSoup sometimes gives exceptions

Question

BeautifulSoup sometimes gives exceptions

It is strange that sometimes the BeautifulSoup object gives the necessary data, but sometimes I get an error, for example, either listindex error or out of range or nonetype object does not have attribute findNext() , which are data that are nested in other elements.

This is the code:

 url = 'http://www.computerstore.nl/product/470130/category-208983/asrock-z97-extreme6.html' source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text) a = soup.find(text=('Socket')).find_next('dd').string print(a)

+5

python html-parsing web-crawler web-scraping beautifulsoup

user3660293 Dec 15 '14 at 15:30

source share

3 answers

This means that the data returned by the store does not contain the items you are looking for for any reason.

Add proper error handling to code that catches exceptions and unloads input when it breaks. This way you can see what has been downloaded and improve the code.

First step:

 try: a = soup.find(text=('Socket')).find_next('dd').string print(a) except: print(plain_text) raise

If you have a lot of text, write it to a file.

It is also dangerous to display so many operations on one line. If something goes wrong, then you will not know that. Divide this into several lines so you can quickly see if it can find a Socket or dd element, etc.

-1

Aaron digulla Dec 15 '14 at 15:44

source share

I made the suggested change for your code:

 url = 'http://www.computerstore.nl/product/470130/category-208983/asrock-z97-extreme6.html' source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text) if soup.find(text=('Socket')): a = soup.find(text=('Socket')).find_next('dd').string else: # Display some error info, and/or do some error logging print "error" print(a)

-1

drsnark Dec 15 '14 at 15:49

source share

alecxe · Accepted Answer · 2014-12-15T15:49:57+0000

The actual problem is that the cell value is not always Socket , sometimes it is surrounded by tabs or spaces. Instead of checking for exact text matches, pass a compiled regex pattern :

 import re soup.find(text=re.compile('Socket')).find_next('dd').get_text(strip=True)

Always prints 1150 .

Explaining this word “sometimes” that I used (thanks @carpetsmoker for the original sentence in the comments):

if you open the page, then clear the cookies and refresh the page, you can see two different views of the same page:

As you can see, the blocks on the page are arranged differently. Therefore, the same page has two different types and the HTML source - what you see is AB-testing :

In marketing and business analytics, A / B testing is jargon for a randomized experiment with two options: A and B, which are controls and treatments in a controlled experiment. This is a form of testing a statistical hypothesis with two options leading to a technical term, testing a hypothesis with two samples, used in the field of statistics.

In other words, they experiment with the product page and collect statistics such as click speed, number of sales made, etc.

FYI, here is the working code that I have at the moment:

 import re from bs4 import BeautifulSoup import requests session = requests.Session() headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'} session.get('http://www.computerstore.nl', headers=headers) response = session.get('http://www.computerstore.nl/product/470130/category-208983/asrock-z97-extreme6.html', headers=headers) soup = BeautifulSoup(response.content) print(soup.find(text=re.compile('Socket')).find_next('dd').get_text(strip=True))

BeautifulSoup sometimes gives exceptions

More articles: