BeautifulSoup - how do I get body contents

I am parsing HTMLwith BeautifulSoup. In the end, I would like to get the content body, but without tags body. But BeautifulSoup adds tags HTML, headand body. I have this discussion googlegrops proposed one possible solution:

>>> from bs4 import BeautifulSoup as Soup
>>> soup = Soup('<p>Some paragraph</p>')
>>> soup.body.hidden = True
>>> soup.body.prettify()
u' <p>\n  Some paragraph\n </p>'

This solution is a hack. There must be a better and obvious way to do this.

+4
source share
1 answer

Do you mean to get everything between body tags?

In this case you can use:

import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen('some_site').read()
soup = BeautifulSoup(page)
body = soup.find('body')
the_contents_of_body_without_body_tags = body.findChildren()
+10
source

Source: https://habr.com/ru/post/1524458/


All Articles