Best algorithm for highlighting a list of given words in an HTML file

I have HTML files that I can’t control. Thus, I cannot change their structure or layout.

For each of these HTML files, a word list will be found based on a different algorithm. These words should be highlighted in the HTML text. For example, if the HTML markup is:

<p> Monkeys are going to die soon, if we don't stop killing them. So, we have to try hard to persuade hunters not to hunt monkeys. Monkeys are very intelligent, and they should survive. In fact, they deserve to survive. </p> 

and a list of words:

 are, we, monkey 

The result should look something like this:

 <p> <span class='highlight'>Monkeys</span> <span class='highlight'>are</span> going to die soon, if <span class='highlight'>we</span> don't stop killing them. So, <span class='highlight'>we</span> have to try hard to persuade hunters not to hunt <span class='highlight'>monkeys</span> . They <span class='highlight'>are</span> very intelligent, and they should survive. In fact, they deserve to survive. </p> 

The allocation algorithm should:

  • be case insensitive
  • must be written in JavaScript (this happens inside the browser) (jQuery is welcome)
  • be fast (applicable for the text of this book with almost 800 pages)
  • the famous browser stop script dialog is not displayed
  • applicable to dirty HTML files (e.g. to support invalid HTML markup, e.g. unclosed

    elements) (some of these files are HTML exports from MS Word, and I think you understand what I mean, dirty !!!)

  • must retain the original HTML markup (without deleting the markup, without changing the markup, except that it includes headwords inside the element, without changing the nesting). HTML should look the same before and after editing, except for highlighted words)

What I have done so far:

  • I get a list of words in JavaScript in an array like ["are", "we", "monkey"]
  • I am trying to select text nodes in a browser (which is currently faulty)
  • I loop over each node text, and for each node text, I iterate over every word in the list and try to find it and wrap it inside the element

Please note that you can watch it online here (username: demo@phis.ir , pass: demo). Also, the current end of the script can be seen at the end of the page source.

+4
source share
5 answers

Combine your words with | into a string and then interpret the string as a regular expression, and then replace the entries with a complete match surrounded by the selected tags.

+3
source

For your example, the following regular expressions work. Maybe you can pick it up from there:

 "Monkeys are going to die soon, if we don't stop killing them. So, we have to try hard to persuade hunters not to hunt monkeys. Monkeys are very intelligent, and they should survive. In fact, they deserve to survive.".replace(/({we|are|monkey[s]?}*)([\s\.,])/gi, "<span class='highlight'>$1</span>$2") 
+2
source

I found this problem very interesting. Here is what I came up with:

  • use some plugin (or write it yourself) so that we can be notified when an element appears in the view
  • parse these elements with text nodes and wrap each word in a span using the unciue name of the css class derived from the word itself
  • add the ability to add css rules for these unqiue class names

sample: http://jsbin.com/welcome/44285/


The code is very hacked and only testet in the latest Chrome, but it worked for me and, of course, you can rely on.

 /** * Highlighter factory * * @return Object */ function highlighter() { var me = {}, cssClassNames = {}, cssClassNamesCount = 0, lastAddedRuleIndex, cbCount = 0, sheet; // add a stylesheet if none present if (document.styleSheets.length === 0) { sheet = document.createElement('style'); $('head').append(sheet); } // get reference to the last stylesheet sheet = document.styleSheets.item(document.styleSheets.length - 1); /** * Returns a constant but unique css class name for the given word * * @param String word * @return String */ function getClassNameForWord(word) { var word = word.toLowerCase(); return cssClassNames[word] = cssClassNames[word] || 'highlight-' + (cssClassNamesCount += 1); } /** * Highlights the given list of words by adding a new css rule to the list of active * css rules * * @param Array words * @param String cssText * @return void */ function highlight(words, cssText) { var i = 0, lim = words.length, classNames = []; // get the needed class names for (; i < lim; i += 1) { classNames.push('.' + getClassNameForWord(words[i])); } // remove the previous added rule if (lastAddedRuleIndex !== undefined) { sheet.deleteRule(lastAddedRuleIndex); } lastAddedRuleIndex = sheet.insertRule(classNames.join(', ') + ' { ' + cssText + ' }', sheet.cssRules.length); } /** * Calls the given function for each text node under the given parent element * * @param DomElement parentElement * @param Function onLoad * @param Function cb * @return void */ function forEachTextNode(parentElement, onLoad, cb) { var i = parentElement.childNodes.length - 1, childNode; for (; i > -1; i -= 1) { childNode = parentElement.childNodes[i]; if (childNode.nodeType === 3) { cbCount += 1; setTimeout(function (node) { return function () { cb(node); cbCount -= 1; if (cbCount === 0 && typeof onLoad === 'Function') { onLoad(me); } }; }(childNode), 0); } else if (childNode.nodeType === 1) { forEachTextNode(childNode, cb); } } } /** * replace each text node by span elements wrapping each word * * @param DomElement contextNode * @param onLoad the parent element * @return void */ function add(contextNode, onLoad) { forEachTextNode(contextNode, onLoad, function (textNode) { var doc = textNode.ownerDocument, frag = doc.createDocumentFragment(), words = textNode.nodeValue.split(/(\W)/g), i = 0, lim = words.length, span; for (; i < lim; i += 1) { if (/^\s*$/m.test(words[i])) { frag.appendChild(doc.createTextNode(words[i])); } else { span = doc.createElement('span'); span.setAttribute('class', getClassNameForWord(words[i])); span.appendChild(doc.createTextNode(words[i])); frag.appendChild(span); } } textNode.parentNode.replaceChild(frag, textNode); }); } // set public api and return created object me.highlight = highlight; me.add = add; return me } var h = highlighter(); h.highlight(['Lorem', 'magna', 'gubergren'], 'background: yellow;'); // on ready $(function ($) { // using the in-view plugin (see the full code in the link above) here, to only // parse elements that are actual visible $('#content > *').one('inview', function (evt, visible) { if (visible) { h.add(this); } }); $(window).scroll(); }); 
+2
source

You can try a lib called Linguigi which I hacked together

 var ling = new Linguigi(); ling.eachToken(/are|we|monkey/g, true, function(text) { return '<span class="highlight">' + text + '</span>'; }); 
+1
source

If you are using jQuery, try this.

 $('* :not(:has(*))').html(function(i, v) { return v.replace(/searchString/g, '<span class="highlight">searchString</span>'); }); $('* :not(:has(*))') will search for each node having no child elements and replace the html string in it with given string warapped in your HTML. 

My quick and dirty solution is based on the sincerity in this blog:

http://wowmotty.blogspot.in/2011/05/jquery-findreplace-text-without.html

Its solution works for div selector and replaces only text, mine is trying to replace innerHTML string.

Try it and say that everything can be done. Seems interesting.

0
source

Source: https://habr.com/ru/post/1444101/


All Articles