I am trying to use BeautifulSoup to fetch content from a website ( http://brooklynexposed.com/events/ ). As an example of a problem, I can run the following code:
import urllib import bs4 as BeautifulSoup url = 'http://brooklynexposed.com/events/' html = urllib.urlopen(url).read() soup = BeautifulSoup.BeautifulSoup(html) print soup.prettify().encode('utf-8')
The output seems to disable html as follows:
<li class="event"> 9:00pm - 11:00pm <br/> <a href="http://brooklynexposed.com/events/entry/5432/2013-07-16"> Comedy Sh </a> </li> </ul> </div> </div> </div> </div> </body> </html>
It disables the listing called Comedy Show along with all the html that appears after the end of the final tags. Most html are automatically deleted. On many sites, I noticed similar things that if the page is too long, BeautifulSoup will not be able to parse the entire page and simply cut out the text. Does anyone have a solution for this? If BeautifulSoup is not capable of handling such pages, does anyone know other libraries with functions like prettify ()?
source share