BeautifulSoup - how do I get body contents

Question

BeautifulSoup - how do I get body contents

I am parsing HTMLwith BeautifulSoup. In the end, I would like to get the content body, but without tags body. But BeautifulSoup adds tags HTML, headand body. I have this discussion googlegrops proposed one possible solution:

>>> from bs4 import BeautifulSoup as Soup
>>> soup = Soup('<p>Some paragraph</p>')
>>> soup.body.hidden = True
>>> soup.body.prettify()
u' <p>\n  Some paragraph\n </p>'

This solution is a hack. There must be a better and obvious way to do this.

+4

python django beautifulsoup html5lib

Philip zedler Jan 30 '14 at 9:44

source share

1 answer

Azwr · Accepted Answer · 2014-01-30T10:02:01+0000

Do you mean to get everything between body tags?

In this case you can use:

import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen('some_site').read()
soup = BeautifulSoup(page)
body = soup.find('body')
the_contents_of_body_without_body_tags = body.findChildren()

BeautifulSoup - how do I get body contents

More articles: