Getting BeautifulSoup to search for a specific <p>

I am trying to build a basic HTML scraper for various scientific journal websites, in particular, try to get an abstraction or introductory paragraph.

The current magazine I'm working on has Nature, and the article I use as a model can be seen on http://www.nature.com/nature/journal/v463/n7284/abs/nature08715.html .

I can not get the abstraction from this page. I am looking for everything between tags <p class="lead">...</p>, but I cannot figure out how to isolate them. I thought it would be something simple, like

from BeautifulSoup import BeautifulSoup
import re
import urllib2

address="http://www.nature.com/nature/journal/v463/n7284/full/nature08715.html"
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)

abstract = soup.find('p', attrs={'class' : 'lead'})
print abstract

Using Python 2.5, BeautifulSoup 3.0.8, doing this returns “No”. I have no way to use anything else that needs to be compiled / installed (e.g. lxml). Is BeautifulSoup confused, or am I?

+3
source share
2 answers

This html is pretty much garbled, and xml.dom.minidom cannot parse, and BeautiFulSoup parsing doesn't work.

I deleted several parts <!-- ... -->and analyzed again using BeautiFulSoup, then it seems better and is able to runsoup.find('p', attrs={'class' : 'lead'})

Here is the code I tried

>>> html =re.sub(re.compile("<!--.*?-->",re.DOTALL),"",html)
>>>
>>> soup=BeautifulSoup(html)
>>>
>>> soup.find('p', attrs={'class' : 'lead'})
<p class="lead">The class of exotic Jupiter-mass planets that orb  .....
+3
source

here is not a BS way to get an abstract.

address="http://www.nature.com/nature/journal/v463/n7284/full/nature08715.html"
html = urllib2.urlopen(address).read()
for para in html.split("</p>"):
    if '<p class="lead">' in para:
        abstract=para.split('<p class="lead">')[1:][0]
        print ' '.join(abstract.split("\n"))
+2
source

Source: https://habr.com/ru/post/1738616/


All Articles