You cannot use BeautifulSoup and any HTML parser to read web pages. You are never guaranteed that a web page is a well-formed document. Let me explain what happens in this case.
This page has this INLINE javascript:
var str="<script src='http://widgets.outbrain.com/outbrainWidget.js'; type='text/javascript'></"+"script>";
You can see that it creates a string that places the script tag on the page. Now, if you are an HTML parser, this is a very difficult task. You read your tokens when you suddenly hit the <script> . Now, unfortunately, if you did this:
<script> alert('hello'); <script> alert('goodby');
Most parsers will say: ok, I found an open script tag. Oh, I found another open script tag! They must have forgotten to close the first! And the parser will consider both to be valid scripts.
So, in this case, BeautifulSoup sees the <script> and although it is inside the javascript line , it looks like it might be a valid start tag, and BeautifulSoup has a capture, it should also.
If you look at the line again, you will see that they are doing this interesting job:
... "</" + "script>";
Does this sound weird? Wouldn't it be better to do str = " ... </script>" without performing additional string concatenation? This is actually a common trick (by stupid people who write script tags as strings, bad practice) to make the analyzer NOT break. Because if you do this:
var a = '</script>';
in an inline script, the parser will appear and really just see </script> and think that the entire script tag has ended and throw the rest of the contents of this script tag onto the page as plain text. This is because you can technically put the closing tag of the script anywhere, even if your JS syntax is invalid. From the parser's point of view, itβs better to first exit the script tag, rather than try to make your HTML code like javascript.
Thus, you cannot use a regular HTML parser to parse web pages. This is a very, very dangerous game. There is no guarantee that you will get well-formed HTML. Depending on what you are trying to do, you can read the contents of the page using a regular expression or try to get the fully displayed content of the page using a browser without a browser