I have a difficult problem. I am working on a script that takes a regex as input. This script then finds all the matches for this regular expression in the document and wraps each match in its own <span> element. The tough part is that the text is a formatted html document, so my script should navigate the DOM and apply regular expression on multiple text nodes at the same time, figuring out where it should separate the text nodes, if necessary.
For example, with a regular expression that captures complete sentences starting with a capital letter and ending with a period, this document:
<p> <b>HTML</b> is a language used to make <b>websites.</b> It was developed by <i>CERN</i> employees in the early 90s. <p>
Will be turned into this:
<p> <span><b>HTML</b> is a language used to make <b>websites.</b></span> <span>It was developed by <i>CERN</i> employees in the early 90s.</span> <p>
The script then returns a list of all created intervals.
I already have a code that finds all the text nodes and stores them in a list along with their position throughout the document and their depth. You really don't need to understand that the code that will help me and its recursive structure can be a bit confusing. T At first, Iβm not sure how to do this, to figure out which elements should be included in the range.
function SmartNode(node, depth, start) { this.node = node; this.depth = depth; this.start = start; } function findTextNodes(node, depth, start) { var list = []; var start = start || 0; depth = (typeof depth !== "undefined" ? depth : -1); if(node.nodeType === Node.TEXT_NODE) { list.push(new SmartNode(node, depth, start)); } else { for(var i=0; i < node.childNodes.length; ++i) { list = list.concat(findTextNodes(node.childNodes[i], depth+1, start)); if(list.length) start += list[list.length-1].node.nodeValue.length; } } return list; }
I believe that Iβll make a line from the entire document, run a regular expression through it and use the list to find which nodes match the regular expressions match, and then split the text nodes accordingly.
But the problem arises when I have a document like this:
<p> This program is <a href="beta.html">not stable yet. Do not use this in production yet.</a> </p>
There is a sentence that starts outside the <a> tag, but ends inside it. Now I do not want the script to split this link into two tags. In a more complex document, this can ruin the page if this happens. The code can either wrap two sentences together:
<p> <span>This program is <a href="beta.html">not stable yet. Do not use this in production yet.</a></span> </p>
Or just wrap each piece in its own element:
<p> <span>This program is </span> <a href="beta.html"> <span>not stable yet.</span> <span>Do not use this in production yet.</span> </a> </p>
There may be a parameter to indicate what it should do. I'm just not sure how to find out when an impossible reduction will occur , and how to restore it.
Another problem arises when I have a space inside a child like :
<p>This is a <b>sentence. </b></p>
Technically, regular expression matching ends immediately after the period, until the end of the <b> . However, it would be much better to consider space as part of the match and wrap it as follows:
<p><span>This is a <b>sentence. </b></span></p>
Than this:
<p><span>This is a </span><b><span>sentence.</span> </b></p>
But this is a secondary problem. In the end, I could just add extra white space to the regex.
I know this may sound like a βdo it for meβ question, and this is not the quick question we see on SO on a daily basis, but I got stuck on it for a while, and this is for the open -source library, which I am working on. Solving this problem is the last hurdle. If you think another SE site is best suited for this issue, please redirect me.