Getting BeautifulSoup to search for a specific <p>
I am trying to build a basic HTML scraper for various scientific journal websites, in particular, try to get an abstraction or introductory paragraph.
The current magazine I'm working on has Nature, and the article I use as a model can be seen on http://www.nature.com/nature/journal/v463/n7284/abs/nature08715.html .
I can not get the abstraction from this page. I am looking for everything between tags <p class="lead">...</p>, but I cannot figure out how to isolate them. I thought it would be something simple, like
from BeautifulSoup import BeautifulSoup
import re
import urllib2
address="http://www.nature.com/nature/journal/v463/n7284/full/nature08715.html"
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
abstract = soup.find('p', attrs={'class' : 'lead'})
print abstract
Using Python 2.5, BeautifulSoup 3.0.8, doing this returns “No”. I have no way to use anything else that needs to be compiled / installed (e.g. lxml). Is BeautifulSoup confused, or am I?
This html is pretty much garbled, and xml.dom.minidom cannot parse, and BeautiFulSoup parsing doesn't work.
I deleted several parts <!-- ... -->and analyzed again using BeautiFulSoup, then it seems better and is able to runsoup.find('p', attrs={'class' : 'lead'})
Here is the code I tried
>>> html =re.sub(re.compile("<!--.*?-->",re.DOTALL),"",html)
>>>
>>> soup=BeautifulSoup(html)
>>>
>>> soup.find('p', attrs={'class' : 'lead'})
<p class="lead">The class of exotic Jupiter-mass planets that orb .....
here is not a BS way to get an abstract.
address="http://www.nature.com/nature/journal/v463/n7284/full/nature08715.html"
html = urllib2.urlopen(address).read()
for para in html.split("</p>"):
if '<p class="lead">' in para:
abstract=para.split('<p class="lead">')[1:][0]
print ' '.join(abstract.split("\n"))