Removing javascript codes while parsing a web page

When capturing the contents of a web page, CURL or file_get_contents , which is the easiest way to remove the built-in javascrip codes. I am thinking of a regex to remove everything between tags; but regular expression is not a reliable method for this purpose.

Is there a better way to parse an html page (just removing javascript codes)? If regex is still the best option, what is the most reliable command for this?

+4
source share
1 answer

You can use DOMDocument and removeChild() . Something like the following should make you go.

 <?php $doc = new DOMDocument; $doc->load('index.html'); $page = $doc->documentElement; // we retrieve the chapter and remove it from the book $scripts = $page->getElementsByTagName('script'); foreach($scripts as $script) { $page->removeChild($script); } echo $doc->saveHTML(); ?> 
+2
source

Source: https://habr.com/ru/post/1380362/


All Articles