: Save html as text

I have javascript code that just shows the source code of an html page

javascript:h=document.getElementsByTagName('html')[0].innerHTML;function%20disp(h){h=h.replace(/</g,%20'\n&lt;');h=h.replace(/>/g,'&gt;');document.getElementsByTagName('body')[0].innerHTML='<pre>&lt;html&gt;'+h.replace(/(\n|\r)+/g,'\n')+'&lt;/html&gt;</pre>';}void(disp(h)); 

I save the code as a bookmark in firefox. So after loading the webpage, when I select the code from the bookmark, and it shows the source code.

Now I am trying to save an html file using python.

 from BeautifulSoup import BeautifulSoup from BeautifulSoup import BeautifulStoneSoup import BeautifulSoup import urllib2 from BeautifulSoup import BeautifulSoup page = urllib2.urlopen("http://www.doctorisin.net/") soup = BeautifulSoup(page) print soup.prettify() fp = open('file.txt','wb') fp.write(soup.prettify()) 

But it does not have all the content that javascript code has. The saved file and the javascript source file do not match. Maybe the python code is not getting all the code (javascript / css tag code) from the html page. What is the problem? Am I doing something wrong? Need help

Thank you

EDITED

As an example of my problem, http://phpjunkyard.com/tutorials/cut-paste-code.php (random site) Go to this site, right-click and select the source of the browsing page (firefox) copies the source code and saves it in text file. Now save the page (save the page as). You can see that both of them are not the same. A saved page (save as) has something more. Python provides output as source code (view page source). It lacks some scripts, forms, etc.

+4
source share
2 answers

If you want to preserve the exact HTML code that the web server provides, do not use BeautifulSoup (which is an HTML parser and most likely will change the code when its fingerprint returns); this would be a better solution:

 import urllib2 file("my_file.txt", "w").write(urllib2.urlopen("http://www.doctorisin.net/").read()) 

Firefox by default saves not only HTML, but also the files needed to render the page (including css and scripts).

+4
source

What you see is the difference between static and dynamic web pages.

Unlike static web pages, dynamic web pages can change the basic html on load. Javascript can unload the full html of the loaded page, as it has access to the modified DOM created by the browser.

In contrast, if the same web page is downloaded from the server and directly submitted to BeautifulSoup , it will only be able to parse it as static html. To get full dynamic content, the page must first be processed by the browser (or equivalent).

+1
source

Source: https://habr.com/ru/post/1390637/


All Articles