How to wrap part of text in node using JavaScript

I have a difficult problem. I am working on a script that takes a regex as input. This script then finds all the matches for this regular expression in the document and wraps each match in its own <span> element. The tough part is that the text is a formatted html document, so my script should navigate the DOM and apply regular expression on multiple text nodes at the same time, figuring out where it should separate the text nodes, if necessary.

For example, with a regular expression that captures complete sentences starting with a capital letter and ending with a period, this document:

<p> <b>HTML</b> is a language used to make <b>websites.</b> It was developed by <i>CERN</i> employees in the early 90s. <p> 

Will be turned into this:

 <p> <span><b>HTML</b> is a language used to make <b>websites.</b></span> <span>It was developed by <i>CERN</i> employees in the early 90s.</span> <p> 

The script then returns a list of all created intervals.

I already have a code that finds all the text nodes and stores them in a list along with their position throughout the document and their depth. You really don't need to understand that the code that will help me and its recursive structure can be a bit confusing. T At first, I’m not sure how to do this, to figure out which elements should be included in the range.

 function SmartNode(node, depth, start) { this.node = node; this.depth = depth; this.start = start; } function findTextNodes(node, depth, start) { var list = []; var start = start || 0; depth = (typeof depth !== "undefined" ? depth : -1); if(node.nodeType === Node.TEXT_NODE) { list.push(new SmartNode(node, depth, start)); } else { for(var i=0; i < node.childNodes.length; ++i) { list = list.concat(findTextNodes(node.childNodes[i], depth+1, start)); if(list.length) start += list[list.length-1].node.nodeValue.length; } } return list; } 

I believe that I’ll make a line from the entire document, run a regular expression through it and use the list to find which nodes match the regular expressions match, and then split the text nodes accordingly.

But the problem arises when I have a document like this:

 <p> This program is <a href="beta.html">not stable yet. Do not use this in production yet.</a> </p> 

There is a sentence that starts outside the <a> tag, but ends inside it. Now I do not want the script to split this link into two tags. In a more complex document, this can ruin the page if this happens. The code can either wrap two sentences together:

 <p> <span>This program is <a href="beta.html">not stable yet. Do not use this in production yet.</a></span> </p> 

Or just wrap each piece in its own element:

 <p> <span>This program is </span> <a href="beta.html"> <span>not stable yet.</span> <span>Do not use this in production yet.</span> </a> </p> 

There may be a parameter to indicate what it should do. I'm just not sure how to find out when an impossible reduction will occur , and how to restore it.

Another problem arises when I have a space inside a child like :

 <p>This is a <b>sentence. </b></p> 

Technically, regular expression matching ends immediately after the period, until the end of the <b> . However, it would be much better to consider space as part of the match and wrap it as follows:

 <p><span>This is a <b>sentence. </b></span></p> 

Than this:

 <p><span>This is a </span><b><span>sentence.</span> </b></p> 

But this is a secondary problem. In the end, I could just add extra white space to the regex.

I know this may sound like a β€œdo it for me” question, and this is not the quick question we see on SO on a daily basis, but I got stuck on it for a while, and this is for the open -source library, which I am working on. Solving this problem is the last hurdle. If you think another SE site is best suited for this issue, please redirect me.

+42
javascript html algorithm regex
Jul 07 '15 at 17:26
source share
4 answers

Here are two ways to handle this.

I do not know if the following is right for you. This is a fairly simple solution to the problem, but at least it doesn't use RegEx to manage HTML tags . It matches patterns with raw text, and then uses the DOM to manage content.




First approach

This approach creates only one <span> for each match, using some of the less common browser APIs.
(See the main problem of this approach below for a demonstration, and if not sure, use the second approach).

The Range class represents a piece of text. It has a surroundContents function that allows you to wrap a range in an element. In addition, he has the disclaimer:

This method is almost equivalent to newNode.appendChild(range.extractContents()); range.insertNode(newNode) newNode.appendChild(range.extractContents()); range.insertNode(newNode) . After surroundings, the end points of the range include newNode .

An exception will be thrown, however, if Range splits a non- Text node with only one of its boundary points. That is, unlike the above alternative, if there are partially selected nodes, they will not be cloned, and instead the operation will fail.

Well, a workaround is provided in MDN, so all is well.

So here is the algorithm:

  • Make a list of Text nodes and store their starting indexes in text
  • Combine the values ​​of these nodes to get Text
  • Find matches in the text and for each match:

    • Find the start and end nodes of the match by comparing the starting indices of the nodes with the match position
    • Create Range Coincidentally
    • Let the browser do the dirty work using the trick above
    • Rebuild the node list since the last action changed the DOM

Here is my implementation with a demo:

 function highlight(element, regex) { var document = element.ownerDocument; var getNodes = function() { var nodes = [], offset = 0, node, nodeIterator = document.createNodeIterator(element, NodeFilter.SHOW_TEXT, null, false); while (node = nodeIterator.nextNode()) { nodes.push({ textNode: node, start: offset, length: node.nodeValue.length }); offset += node.nodeValue.length } return nodes; } var nodes = getNodes(nodes); if (!nodes.length) return; var text = ""; for (var i = 0; i < nodes.length; ++i) text += nodes[i].textNode.nodeValue; var match; while (match = regex.exec(text)) { // Prevent empty matches causing infinite loops if (!match[0].length) { regex.lastIndex++; continue; } // Find the start and end text node var startNode = null, endNode = null; for (i = 0; i < nodes.length; ++i) { var node = nodes[i]; if (node.start + node.length <= match.index) continue; if (!startNode) startNode = node; if (node.start + node.length >= match.index + match[0].length) { endNode = node; break; } } var range = document.createRange(); range.setStart(startNode.textNode, match.index - startNode.start); range.setEnd(endNode.textNode, match.index + match[0].length - endNode.start); var spanNode = document.createElement("span"); spanNode.className = "highlight"; spanNode.appendChild(range.extractContents()); range.insertNode(spanNode); nodes = getNodes(); } } // Test code var testDiv = document.getElementById("test-cases"); var originalHtml = testDiv.innerHTML; function test() { testDiv.innerHTML = originalHtml; try { var regex = new RegExp(document.getElementById("regex").value, "g"); highlight(testDiv, regex); } catch(e) { testDiv.innerText = e; } } document.getElementById("runBtn").onclick = test; test(); 
 .highlight { background-color: yellow; border: 1px solid orange; border-radius: 5px; } .section { border: 1px solid gray; padding: 10px; margin: 10px; } 
 <form class="section"> RegEx: <input id="regex" type="text" value="[AZ].*?\." /> <button id="runBtn">Highlight</button> </form> <div id="test-cases" class="section"> <div>foo bar baz</div> <p> <b>HTML</b> is a language used to make <b>websites.</b> It was developed by <i>CERN</i> employees in the early 90s. <p> <p> This program is <a href="beta.html">not stable yet. Do not use this in production yet.</a> </p> <div>foo bar baz</div> </div> 

Well, that was a lazy approach, which unfortunately does not work for some cases. It works well if you select only individual elements, but it is interrupted when there are block elements on this path, due to the following property of the extractContents function:

Partially selected nodes are cloned to include the parent tags necessary to make the document fragment valid.

This is bad. It simply duplicates block-level nodes. Try the previous baz\s+HTML regular expression demo if you want to see how it breaks.




Second approach

This approach iterates over the matching nodes, creating <span> tags.

The general algorithm is simple because it simply wraps each node match in its own <span> . But this means that we have to deal with partially matching text nodes, which requires even more effort.

If the node text partially matches, it breaks into splitText :

After splitting, the current node contains all the content up to the specified offset point, and a newly created node of the same type contains the remaining text. The created node is returned to the caller.

 function highlight(element, regex) { var document = element.ownerDocument; var nodes = [], text = "", node, nodeIterator = document.createNodeIterator(element, NodeFilter.SHOW_TEXT, null, false); while (node = nodeIterator.nextNode()) { nodes.push({ textNode: node, start: text.length }); text += node.nodeValue } if (!nodes.length) return; var match; while (match = regex.exec(text)) { var matchLength = match[0].length; // Prevent empty matches causing infinite loops if (!matchLength) { regex.lastIndex++; continue; } for (var i = 0; i < nodes.length; ++i) { node = nodes[i]; var nodeLength = node.textNode.nodeValue.length; // Skip nodes before the match if (node.start + nodeLength <= match.index) continue; // Break after the match if (node.start >= match.index + matchLength) break; // Split the start node if required if (node.start < match.index) { nodes.splice(i + 1, 0, { textNode: node.textNode.splitText(match.index - node.start), start: match.index }); continue; } // Split the end node if required if (node.start + nodeLength > match.index + matchLength) { nodes.splice(i + 1, 0, { textNode: node.textNode.splitText(match.index + matchLength - node.start), start: match.index + matchLength }); } // Highlight the current node var spanNode = document.createElement("span"); spanNode.className = "highlight"; node.textNode.parentNode.replaceChild(spanNode, node.textNode); spanNode.appendChild(node.textNode); } } } // Test code var testDiv = document.getElementById("test-cases"); var originalHtml = testDiv.innerHTML; function test() { testDiv.innerHTML = originalHtml; try { var regex = new RegExp(document.getElementById("regex").value, "g"); highlight(testDiv, regex); } catch(e) { testDiv.innerText = e; } } document.getElementById("runBtn").onclick = test; test(); 
 .highlight { background-color: yellow; } .section { border: 1px solid gray; padding: 10px; margin: 10px; } 
 <form class="section"> RegEx: <input id="regex" type="text" value="[AZ].*?\." /> <button id="runBtn">Highlight</button> </form> <div id="test-cases" class="section"> <div>foo bar baz</div> <p> <b>HTML</b> is a language used to make <b>websites.</b> It was developed by <i>CERN</i> employees in the early 90s. <p> <p> This program is <a href="beta.html">not stable yet. Do not use this in production yet.</a> </p> <div>foo bar baz</div> </div> 

This should be good enough for most cases, I hope. If you need to minimize the number of <span> tags, this can be done by expanding this function, but I wanted to keep it simple for now.

+27
Jul 12 '15 at 16:25
source share

As everyone has already said, this is more of an academic question, because in reality it is not how you do it. That being said, it seemed funny, so here is one approach.

EDIT: I think now I understand the essence of this.

 function myReplace(str) { myRegexp = /((^<[^>*]>)+|([^<>\.]*|(<[^\/>]*>[^<>\.]+<\/[^>]*>)+)*[^<>\.]*\.\s*|<[^>]*>|[^\.<>]+\.*\s*)/g; arr = str.match(myRegexp); var out = ""; for (i in arr) { var node = arr[i]; if (node.indexOf("<")===0) out += node; else out += "<span>"+node+"</span>"; // Here is where you would run whichever // regex you want to match by } document.write(out.replace(/</g, "&lt;").replace(/>/g, "&gt;")+"<br>"); console.log(out); } myReplace('<p>This program is <a href="beta.html">not stable yet. Do not use this in production yet.</a></p>'); myReplace('<p>This is a <b>sentence. </b></p>'); myReplace('<p>This is a <b>another</b> and <i>more complex</i> even <b>super complex</b> sentence.</p>'); myReplace('<p>This is a <b>a sentence</b>. Followed <i>by</i> another one.</p>'); myReplace('<p>This is a <b>an even</b> more <i>complex sentence. </i></p>'); /* Will output: <p><span>This program is </span><a href="beta.html"><span>not stable yet. </span><span>Do not use this in production yet.</span></a></p> <p><span>This is a </span><b><span>sentence. </span></b></p> <p><span>This is a <b>another</b> and <i>more complex</i> even <b>super complex</b> sentence.</span></p> <p><span>This is a <b>a sentence</b>. </span><span>Followed <i>by</i> another one.</span></p> <p><span>This is a </span><b><span>an even</span></b><span> more </span><i><span>complex sentence. </span></i></p> */ 
+5
Jul 12 '15 at 1:19
source share

I would use the "flat DOM" view for such a task.

In a flat DOM, this paragraph

 <p>abc <a href="beta.html">def. ghij.</p> 

will be represented by two vectors:

 chars: "abc def. ghij.", props: ....aaaaaaaaaa, 

You will use the normal regular expression for chars to indicate span areas in the props vector:

 chars: "abc def. ghij." props: ssssaaaaaaaaaa ssss sssss 

I use a schematic diagram here, this real structure is an array of arrays:

 props: [ [s], [s], [s], [s], [a,s], [a,s], ... ] 

transformation tree - DOM ↔ flat-DOM can use simple state machines.

At the end, you will convert the flat DOM to a tree DOM, which will look like this:

 <p><s>abc </s><a href="beta.html"><s>def.</s> <s>ghij.</s></p> 

Just in case: I use this approach in my WYSIWYG HTML editors.

+4
Jul 11 '15 at 22:58
source share

 function parseText( element ){ var stack = [ element ]; var group = false; var re = /(?!\s|$).*?(\.|$)/; while ( stack.length > 0 ){ var node = stack.shift(); if ( node.nodeType === Node.TEXT_NODE ) { if ( node.textContent.trim() != "" ) { var match; while( node && (match = re.exec( node.textContent )) ) { var start = group ? 0 : match.index; var length = match[0].length + match.index - start; if ( start > 0 ) { node = node.splitText( start ); } var wrapper = document.createElement( 'span' ); var next = null; if ( match[1].length > 0 ){ if ( node.textContent.length > length ) next = node.splitText( length ); group = false; wrapper.className = "sentence sentence-end"; } else { wrapper.className = "sentence"; group = true; } var parent = node.parentNode; var sibling = node.nextSibling; wrapper.appendChild( node ); if ( sibling ) parent.insertBefore( wrapper, sibling ); else parent.appendChild( wrapper ); node = next; } } } else if ( node.nodeType === Node.ELEMENT_NODE || node.nodeType === Node.DOCUMENT_NODE ) { stack.unshift.apply( stack, node.childNodes ); } } } parseText( document.body ); 
 .sentence { text-decoration: underline wavy red; } .sentence-end { border-right: 1px solid red; } 
 <p>This is a sentence. This is another sentence.</p> <p>This sentence has <strong>emphasis</strong> inside it.</p> <p><span>This sentence spans</span><span> two elements.</span></p> 
+4
Jul 12 '15 at 0:42
source share



All Articles