Matching a string only if it is not in the <script> or tags

Question

Matching a string only if it is not in the <script> or tags

I am working on a browser plugin that replaces all instances of "someString" (as defined by a complex regular expression) with <a href="http://domain.com/$1">$1</a> . This usually works fine, making a global replacement on the innerHTML body. However, it breaks the page when it finds (and replaces) "someString" inside the <script> tags (i.e. Like a JS variable or other JS link). It also breaks if "someString" is already part of the binding.

So basically I want to make a global replacement in all instances of "someString" if it doesn't fall into the <script></script> or <a></a> .

Essentially, I have:

 var body = document.getElementsByTagName('body')[0].innerHTML; body = body.replace(/(someString)/gi, '<a href="http://domain.com/$1">$1</a>'); document.getElementsByTagName('body')[0].innerHTML = body;

But obviously this is not very good. I struggled for hours and read all the answers here (including the many stubborn ones who insist on regex should not be used with HTML), so I'm open to suggestions on how to do this. I would prefer to use direct JS, but can use jQuery if necessary.

Edit - HTML Example :

 <body> someString <script type="text/javascript"> var someString = 'blah'; console.log(someString); </script> <a href="someString.html">someString</a> </body>

In this case, only the very first instance of "someString" should be replaced.

+4

javascript dom regex

ggutenberg Jan 11 '11 at 14:20

source share

5 answers

Well, you can use XPath with Mozilla (assuming you're writing a plugin for FireFox). Call document.evaluate . Or you can use the XPath library to do this (there are several there) ...

 var matches = document.evaluate( '//*[not(name() = "a") and not(name() = "script") and contains(., "string")]', document, null, XPathResult.UNORDERED_NODE_ITERATOR_TYPE null );

Then replace the callback function:

 var callback = function(node) { var text = node.nodeValue; text = text.replace(/(someString)/gi, '<a href="http://domain.com/$1">$1</a>'); var div = document.createElement('div'); div.innerHTML = text; for (var i = 0, l = div.childNodes.length; i < l; i++) { node.parentNode.insertBefore(div.childNodes[i], node); } node.parentNode.removeChild(node); }; var nodes = []; //cache the tree since we want to modify it as we iterate var node = matches.iterateNext(); while (node) { nodes.push(node); node = matches.iterateNext(); } for (var key = 0, length = nodes.length; key < length; key++) { node = nodes[key]; // Check for a Text node if (node.nodeType == Node.TEXT_NODE) { callback(node); } else { for (var i = 0, l = node.childNodes.length; i < l; i++) { var child = node.childNodes[i]; if (child.nodeType == Node.TEXT_NODE) { callback(child); } } } }

+2

ircmaxell Jan 11 '11 at 16:44

source share

I know that you do not want to hear it, but it does not seem to work for regular expression. Regular expressions do not perform negative matches very well before they become complex and unreadable.

Perhaps this regular expression may be close enough, though:

 />[^<]*(someString)[^<]*</

It captures any instance of someString that is between the characters a> and a.

+1

Jeff Jan 11 '11 at 14:45

source share

Another idea: if you use jQuery you can use: contains a pseudo selector.

 $('*:contains(someString)').each(function(i) { var markup = $(this).html(); // modify markup to insert anchor tag $(this).html(markup) });

This will capture any DOM element containing the text "someString" in it. I don’t think it will pass the <script> tags or you should be good.

+1

Jeff Jan 11 '11 at 18:17

source share

You can try the following:

 /(someString)(?![^<]*?(<\/a>|<\/script>))/

I have not tested every script, but basically uses a negative lookahead to search for the next opening bracket after someString , and if that bracket is part of a binding or script closing tag, it does not match.

Your example seems to work in this fiddle , although it certainly does not cover all the possibilities. In cases where innerHTML in your <a></a> contains tags (for example, <b> or <span> ), or the code in the script tags generates html (contains lines with tags in it), you need something more complex .

+1

Mike c Jan 12 '11 at 14:52

source share

arcain · Accepted Answer · 2011-01-14T07:38:17+0000

Try this and see if it suits your needs (tested in IE 8 and Chrome).

 <script src="jquery-1.4.4.js" type="text/javascript"></script> <script> var pattern = /(someString)/gi; var replacement = "<a href=\"http://domain.com/$1\">$1</a>"; $(function() { $("body :not(a,script)") .contents() .filter(function() { return this.nodeType == 3 && this.nodeValue.search(pattern) != -1; }) .each(function() { var span = document.createElement("span"); span.innerHTML = "&nbsp;" + $.trim(this.nodeValue.replace(pattern, replacement)); this.parentNode.insertBefore(span, this); this.parentNode.removeChild(this); }); }); </script>

The code uses jQuery to search for all text nodes in the <body> document that are not in the <anchor> or <script> blocks and contain a search pattern. After they are found, a range is entered containing the target content of the node, and the old text of the node is deleted.

The only problem I ran into was that IE 8 handles text nodes containing only spaces differently than Chrome, so sometimes a replacement loses leading space, hence inserting an inextricable space in front of the text containing regular expression replacements.

Matching a string only if it is not in the <script> or tags

More articles: