Removing all script tags from html with JS regex

Question

Removing all script tags from html with JS regex

I want to cut script tags from this html in pastebin

http://pastebin.com/mdxygM0a

I tried using the following regex

html.replace(/<script.*>.*<\/script>/ims, " ")

But it does not remove all script tags in html. It removes only embedded scripts. I need a regex that can remove all script tags (in-line and multi-line). It would be highly appreciated if the test is conducted on my example http://pastebin.com/mdxygM0a

thank

+57

javascript html regex

Kennedy Jul 12 '11 at 4:01

source share

13 answers

jQuery uses regex to remove script tags in some cases, and I'm sure its developers had a damn good reason. Probably some browsers execute scripts when they are inserted using innerHTML .

Here's the regex:

 /<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi

And before people start crying, "but regular expressions for HTML are evil": Yes, they are - but for script tags they are safe because of special behavior - the <script> section cannot contain </script> at all, if it doesn’t must end in this position. Thus, it is easy to match it with a regular expression. However, in quick lookup, the regex does not take into account trailing spaces inside the closing tag, so you will need to check if </script , etc. will work.

+93

ThiefMaster Jul 12 '11 at 6:29

source share

Regexes are bit-wise, but if you have a string version of HTML that you don't want to inject into the DOM, they might be the best approach. You can put it in a loop to handle something like:

 <scr<script>Ha!</script>ipt> alert(document.cookie);</script>

Here is what I did using the jquery regex above:

 var SCRIPT_REGEX = /<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi; while (SCRIPT_REGEX.test(text)) { text = text.replace(SCRIPT_REGEX, ""); }

+38

Conrad Damon Mar 28 2018-12-12T00:

source share

This regex should work too:

 <script(?:(?!\/\/)(?!\/\*)[^'"]|"(?:\\.|[^"\\])*"|'(?:\\.|[^'\\])*'|\/\/.*(?:\n)|\/\*(?:(?:.|\s))*?\*\/)*?<\/script>

It even allows you to have “problematic” string variables like these inside:

 <script type="text/javascript"> var test1 = "</script>"; var test2 = '\'</script>'; var test1 = "\"</script>"; var test1 = "<script>\""; var test2 = '<scr\'ipt>'; /* </script> */ // </script> /* ' */ // var foo=" ' </script>

He throws jQuery and Prototype fail on these ...

Edit July 31 '17: Added a) non-exciting groups for better performance (and without empty groups) and b) support for JavaScript comments.

+10

spaark Aug 05 '13 at 7:15

source share

Whenever you have to resort to fixing tag tags based on a Regex script. At least add a space to the closing tag in the form

 </script\s*>

Otherwise, things like

 <script>alert(666)</script >

will remain after spaces after valid tags.

+10

neongrau Apr 27 '15 at 8:15

source share

Why not use jQuery.parseHTML () http://api.jquery.com/jquery.parsehtml/ ?

+4

shao Feb 06 '14 at 23:23

source share

In my case, I needed to demand to parse the page header AND and have all the other kindness of jQuery, except for the startup scripts. Here is my solution that seems to work.

  $.get('/somepage.htm', function (data) { // excluded code to extract title for simplicity var bodySI = data.indexOf('<body>') + '<body>'.length, bodyEI = data.indexOf('</body>'), body = data.substr(bodySI, bodyEI - bodySI), $body; body = body.replace(/<script[^>]*>/gi, ' <!-- '); body = body.replace(/<\/script>/gi, ' --> '); //console.log(body); $body = $('<div>').html(body); console.log($body.html()); });

This kind of shortcuts worries about the script because you are not trying to remove the tags and contents of the script, instead you replace them with comment rendering schemes so that they are useless to break, since you will have comments restricting your script declarations.

Let me know if this is still a problem, as it will help me too.

+1

Jason Sebring Oct 03

source share

Here are a few shell scripts that you can use to highlight different elements.

 # doctype find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/<\!DOCTYPE\s\+html[^>]*>/<\!DOCTYPE html>/gi" {} \; # meta charset find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/<meta[^>]*content=[\"'][^\"']*utf-8[\"'][^>]*>/<meta charset=\"utf-8\">/gi" {} \; # script text/javascript find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/\(<script[^>]*\)\(\stype=[\"']text\/javascript[\"']\)\(\s\?[^>]*>\)/\1\3/gi" {} \; # style text/css find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/\(<style[^>]*\)\(\stype=[\"']text\/css[\"']\)\(\s\?[^>]*>\)/\1\3/gi" {} \; # html xmlns find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/\(<html[^>]*\)\(\sxmlns=[\"'][^\"']*[\"']\)\(\s\?[^>]*>\)/\1\3/gi" {} \; # html xml:lang find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/\(<html[^>]*\)\(\sxml:lang=[\"'][^\"']*[\"']\)\(\s\?[^>]*>\)/\1\3/gi" {} \;

0

davidcondrey Mar 25 '14 at 8:30

source share

/: </ s \ w * / g; ((</ s \ w) <[^ & l] ?!) - Deletes any sequence in any combination with

0

Blackening Apr 08 '14 at 7:17

source share

If you want to remove all JavaScript code from some HTML text, removing <script> tags is not enough, because JavaScript can still live in onclick, onerror, href and other attributes.

Try this npm module that handles all this: https://www.npmjs.com/package/strip-js

0

Shivanshu Goyal Oct 09 '16 at 21:05

source share

You can try

 $("your_div_id").remove();

or

  $("your_div_id").html("");

0

Pooja Roy Nov 16 '16 at 10:12

source share

Try the following:

 var text = text.replace(/<script[^>]*>(?:(?!<\/script>)[^])*<\/script>/g, "")

0

surinder singh Mar 09 '17 at 10:59 on

source share

This modified version works very well:

 /<\s*script\b[^<]*(?:(?!<\/script\s*>)<[^<]*)*<\s*\/\s*script\s*>/gi

0

Paul W 02 Dec '18 at 23:32

source share

RobG · Accepted Answer · 2011-07-12 06:09

Trying to remove HTML markup using regex is problematic. You do not know what the script or attribute values are. One way is to insert it as innerHTML into a div, remove any script elements and return innerHTML, for example.

  function stripScripts(s) { var div = document.createElement('div'); div.innerHTML = s; var scripts = div.getElementsByTagName('script'); var i = scripts.length; while (i--) { scripts[i].parentNode.removeChild(scripts[i]); } return div.innerHTML; } alert( stripScripts('<span><script type="text/javascript">alert(\'foo\');<\/script><\/span>') );

Note that at this time, browsers will not run the script if they are inserted using the innerHTML property, and probably will never be particularly important since the element is not added to the document.

Removing all script tags from html with JS regex

More articles: