Parsing HTML with BeautifulSoup 4 and Python

Question

Parsing HTML with BeautifulSoup 4 and Python

I am trying to parse the resulting list http://mobile.de .

At first I tried it with the HTMLParser class, but got an error: HTMLParser.HTMLParseError: EOF in middle of construct .

So, I tried it with BeautifulSoup 4, which is better suited for invalid websites, but <div> Im Search for is not available, and I can’t tell if its error or websites.

 from bs4 import BeautifulSoup import urllib import socket searchurl = "http://suchen.mobile.de/auto/search.html?scopeId=C&isSearchRequest=true&sortOption.sortBy=price.consumerGrossEuro" f = urllib.urlopen(searchurl) html = f.read() soup = BeautifulSoup(html) for link in soup.find_all("div","listEntry "): print link

listEntry is a <div> with the result of the cars. But it seems that it does not parse <form id="parkAndCompareVehicle" name="parkAndCompareVehicle" action=""> . I can not find the form in soupobject.

Where is the mistake?

+6

python html html-parsing beautifulsoup

user1010775 Mar 30 '12 at 8:17

source share

1 answer

gorlum0 · Accepted Answer · 2012-03-30T08:26:51+0000

It should be something like:

 for link in soup.findAll('div', {'class': 'listEntry '}): print link

Attributes are specified in the dictionary - findAll(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

===========

update Sorry, it seems that in bs4 you can do this too.

As for the error, the form you are looking for is not in the results because it covers the list of Entries as far as I can see.

What is wrong with this:

 form = soup.find('form', id='parkAndCompareVehicle') print len(form.find_all('div', 'listEntry '))

Parsing HTML with BeautifulSoup 4 and Python

More articles: