Removing javascript codes while parsing a web page

Question

Removing javascript codes while parsing a web page

When capturing the contents of a web page, CURL or file_get_contents , which is the easiest way to remove the built-in javascrip codes. I am thinking of a regex to remove everything between tags; but regular expression is not a reliable method for this purpose.

Is there a better way to parse an html page (just removing javascript codes)? If regex is still the best option, what is the most reliable command for this?

+4

html php regex parsing html-parsing

Googlebot Nov 09 '11 at 10:18

source share

1 answer

Treffynnon · Accepted Answer · 2011-11-09T10:28:12+0000

You can use DOMDocument and removeChild() . Something like the following should make you go.

 <?php $doc = new DOMDocument; $doc->load('index.html'); $page = $doc->documentElement; // we retrieve the chapter and remove it from the book $scripts = $page->getElementsByTagName('script'); foreach($scripts as $script) { $page->removeChild($script); } echo $doc->saveHTML(); ?>

Removing javascript codes while parsing a web page

More articles: