BeautifulSoup can't parse a webpage?

I am using a beautiful soup to parse a web page now, I heard that it is very famous and good, but it does not work fine.

Here is what i did

import urllib2 from bs4 import BeautifulSoup page = urllib2.urlopen("http://www.cnn.com/2012/10/14/us/skydiver-record-attempt/index.html?hpt=hp_t1") soup = BeautifulSoup(page) print soup.prettify() 

I think this is pretty simple. I open a web page and convey it beautiful. But here is what I got:

Warning (from warnings module):

File "C:\Python27\lib\site-packages\bs4\builder\_htmlparser.py", line 149

"Python built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help."))

...

HTMLParseError: bad end tag: u'</"+"script>', at line 634, column 94

I thought the CNN site should be well designed, so I'm not very sure what is going on. Does anyone have an idea about this?

+4
source share
4 answers

From the docs :

If you can, I recommend that you install and use lxml for speed. If you are using a version of Python 2 earlier than 2.7.3 or a version of Python 3 to 3.2.2, it is important that you install lxml or the html5lib-Pythons built-in HTML analyzer is simply not very good in older versions.

Your code works as is (on Python 2.7, Python 3.3) if you install a more robust parser on Python 2.7 (e.g. lxml or html5lib):

 try: from urllib2 import urlopen except ImportError: from urllib.request import urlopen # py3k from bs4 import BeautifulSoup # $ pip install beautifulsoup4 url = "http://www.cnn.com/2012/10/14/us/skydiver-record-attempt/index.html?hpt=hp_t1" soup = BeautifulSoup(urlopen(url)) print(soup.prettify()) 

HTMLParser.py - More robust SCRIPT tag analysis can be linked.

+10
source

You cannot use BeautifulSoup and any HTML parser to read web pages. You are never guaranteed that a web page is a well-formed document. Let me explain what happens in this case.

This page has this INLINE javascript:

 var str="<script src='http://widgets.outbrain.com/outbrainWidget.js'; type='text/javascript'></"+"script>"; 

You can see that it creates a string that places the script tag on the page. Now, if you are an HTML parser, this is a very difficult task. You read your tokens when you suddenly hit the <script> . Now, unfortunately, if you did this:

 <script> alert('hello'); <script> alert('goodby'); 

Most parsers will say: ok, I found an open script tag. Oh, I found another open script tag! They must have forgotten to close the first! And the parser will consider both to be valid scripts.

So, in this case, BeautifulSoup sees the <script> and although it is inside the javascript line , it looks like it might be a valid start tag, and BeautifulSoup has a capture, it should also.

If you look at the line again, you will see that they are doing this interesting job:

 ... "</" + "script>"; 

Does this sound weird? Wouldn't it be better to do str = " ... </script>" without performing additional string concatenation? This is actually a common trick (by stupid people who write script tags as strings, bad practice) to make the analyzer NOT break. Because if you do this:

 var a = '</script>'; 

in an inline script, the parser will appear and really just see </script> and think that the entire script tag has ended and throw the rest of the contents of this script tag onto the page as plain text. This is because you can technically put the closing tag of the script anywhere, even if your JS syntax is invalid. From the parser's point of view, it’s better to first exit the script tag, rather than try to make your HTML code like javascript.

Thus, you cannot use a regular HTML parser to parse web pages. This is a very, very dangerous game. There is no guarantee that you will get well-formed HTML. Depending on what you are trying to do, you can read the contents of the page using a regular expression or try to get the fully displayed content of the page using a browser without a browser

+7
source

you need to use html5lib parser with BeautifulSoup

To install the reqd parser, use pip:

 pip install html5lib 

then use this parser this way

 import mechanize br = mechanize.Browser() html = br.open("http://google.com/",timeout=100).read() soup = BeautifulSoup(html,'html5lib') a_s = soup.find_all('a') for i in range(0,len(a_s)): print a_s[i]['href'] 
+2
source

One of the simplest things you can do is specify the content as "lxml". you can do this by adding "lxml" to the urlopen () function as a parameter

page = urllib2.urlopen ("[url]", "lxml")

Then your code will look like this.

import urllib2from bs4 import BeautifulSoup page = urllib2.urlopen("http://www.cnn.com/2012/10/14/us/skydiver-record-attempt/index.html?hpt=hp_t1","lxml") soup = BeautifulSoup(page) print soup.prettify()

So far I have not received any problems from this approach :)

+1
source

Source: https://habr.com/ru/post/1439698/


All Articles