BeautifulSoup does not extract all html (automatically deleting most of the html page)

Question

BeautifulSoup does not extract all html (automatically deleting most of the html page)

I am trying to use BeautifulSoup to fetch content from a website ( http://brooklynexposed.com/events/ ). As an example of a problem, I can run the following code:

import urllib import bs4 as BeautifulSoup url = 'http://brooklynexposed.com/events/' html = urllib.urlopen(url).read() soup = BeautifulSoup.BeautifulSoup(html) print soup.prettify().encode('utf-8')

The output seems to disable html as follows:

  <li class="event"> 9:00pm - 11:00pm <br/> <a href="http://brooklynexposed.com/events/entry/5432/2013-07-16"> Comedy Sh </a> </li> </ul> </div> </div> </div> </div> </body> </html>

It disables the listing called Comedy Show along with all the html that appears after the end of the final tags. Most html are automatically deleted. On many sites, I noticed similar things that if the page is too long, BeautifulSoup will not be able to parse the entire page and simply cut out the text. Does anyone have a solution for this? If BeautifulSoup is not capable of handling such pages, does anyone know other libraries with functions like prettify ()?

+6

python urllib beautifulsoup

user2540231 Jul 15 '13 at 17:25

source share

2 answers

Siva cn · Answer 1 · 2013-10-28T18:06:56+0000

It works fine for me, but I get an error when I say soup.prettify().encode('utf-8')

 >>> from BeautifulSoup import BeautifulSoup as bs >>> >>> import urllib >>> url = 'http://brooklynexposed.com/events/' >>> html = urllib.urlopen(url).read() >>> >>> >>> soup = bs(html) >>> soup.prettify().encode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8788: ordinal not in range(128) >>> >>> soup.prettify() '<!doctype html>\n<!--[if lt IE 7 ]&gt; &lt;html class="no-js ie6" lang="en"&gt; &lt;![endif]-->\n <!--[if IE 7 ]&gt; ... ... ... ... </body>\n</html>\n'

. .., I think this can help you: BeautifulSoup, where do you put my HTML?

guettli · Answer 2 · 2016-04-08T13:08:56+0000

I had problems that bs4 cuts html on some machines and not on some. It did not play ....

I switched to this:

 soup = bs4.BeautifulSoup(html, 'html5lib')

.. and now it works.

BeautifulSoup does not extract all html (automatically deleting most of the html page)

More articles: