Keyword highlighting highlights highlights in PHP preg_replace ()

Question

Keyword highlighting highlights highlights in PHP preg_replace ()

I have a small search engine that does its job, and I want to highlight the results. I thought that all this worked until the set of keywords that I used today blew it out of the water.

The problem is that preg_replace () is looping through replacements, and later replacements replace the text that I inserted in the previous ones. Embarrassed? Here is my pseudo-function:

public function highlightKeywords ($data, $keywords = array()) { $find = array(); $replace = array(); $begin = "<span class=\"keywordHighlight\">"; $end = "</span>"; foreach ($keywords as $kw) { $find[] = '/' . str_replace("/", "\/", $kw) . '/iu'; $replace[] = $begin . "\$0" . $end; } return preg_replace($find, $replace, $data); }

OK, so it works when searching for "fred" and "dagg", but unfortunately when searching for "class" and "lass" and "as" it encounters a real problem when it selects "Joseph Class Group"

 Joseph <span class="keywordHighlight">Cl</span><span <span c<span <span class="keywordHighlight">cl</span>ass="keywordHighlight">lass</span>="keywordHighlight">c<span <span class="keywordHighlight">cl</span>ass="keywordHighlight">lass</span></span>="keywordHighlight">ass</span> Group

How do I get the latest replacements to work only with components other than HTML, but also allow marking the whole match? for example, if I were looking for "cla" and "lass", I would like the "class" to be completely highlighted, because it contains search terms even if they overlap, and the selection that was applied to the first match has a class "in it, but not to stand out.

Sigh.

I would rather use a PHP solution than jQuery (or any client-side).

Note. I tried to sort the keywords by length, first by making long ones, but this means that cross-queries are not highlighted, which means “cla” and “lass” only part of the word “class” will stand out and it still killed the replacement tags :(

EDITOR: I mixed up, starting with pencil and paper and wild wanderings, and came up with a very unglazed code to solve this problem. This is not great, so suggestions for trimming / speeding this up will still be highly appreciated :)

 public function highlightKeywords ($data, $keywords = array()) { $find = array(); $replace = array(); $begin = "<span class=\"keywordHighlight\">"; $end = "</span>"; $hits = array(); foreach ($keywords as $kw) { $offset = 0; while (($pos = stripos($data, $kw, $offset)) !== false) { $hits[] = array($pos, $pos + strlen($kw)); $offset = $pos + 1; } } if ($hits) { usort($hits, function($a, $b) { if ($a[0] == $b[0]) { return 0; } return ($a[0] < $b[0]) ? -1 : 1; }); $thisthat = array(0 => $begin, 1 => $end); for ($i = 0; $i < count($hits); $i++) { foreach ($thisthat as $key => $val) { $pos = $hits[$i][$key]; $data = substr($data, 0, $pos) . $val . substr($data, $pos); for ($j = 0; $j < count($hits); $j++) { if ($hits[$j][0] >= $pos) { $hits[$j][0] += strlen($val); } if ($hits[$j][1] >= $pos) { $hits[$j][1] += strlen($val); } } } } } return $data; }

+4

php regex keyword highlight preg-replace

Crazychris Jan 31 '12 at 21:50

source share

3 answers

Steve · Answer 1 · 2012-02-01T02:34:25+0000

I used the following to solve this problem:

 <?php $protected_matches = array(); function protect(&$matches) { global $protected_matches; return "\0" . array_push($protected_matches, $matches[0]) . "\0"; } function restore(&$matches) { global $protected_matches; return '<span class="keywordHighlight">' . $protected_matches[$matches[1] - 1] . '</span>'; } preg_replace_callback('/\x0(\d+)\x0/', 'restore', preg_replace_callback($patterns, 'protect', $target_string));

The first preg_replace_callback pulls out all matches and replaces them with placeholders with a zero byte; the second pass replaces them with span tags.

Edit: Forgot to mention that $patterns was sorted by the length of the string, the longest and shortest.

Edit; another solution

 <?php function highlightKeywords($data, $keywords = array(), $prefix = '<span class="hilite">', $suffix = '</span>') { $datacopy = strtolower($data); $keywords = array_map('strtolower', $keywords); $start = array(); $end = array(); foreach ($keywords as $keyword) { $offset = 0; $length = strlen($keyword); while (($pos = strpos($datacopy, $keyword, $offset)) !== false) { $start[] = $pos; $end[] = $offset = $pos + $length; } } if (!count($start)) return $data; sort($start); sort($end); // Merge and sort start/end using negative values to identify endpoints $zipper = array(); $i = 0; $n = count($end); while ($i < $n) $zipper[] = count($start) && $start[0] <= $end[$i] ? array_shift($start) : -$end[$i++]; // EXAMPLE: // [ 9, 10, -14, -14, 81, 82, 86, -86, -86, -90, 99, -103 ] // take 9, discard 10, take -14, take -14, create pair, // take 81, discard 82, discard 86, take -86, take -86, take -90, create pair // take 99, take -103, create pair // result: [9,14], [81,90], [99,103] // Generate non-overlapping start/end pairs $a = array_shift($zipper); $z = $x = null; while ($x = array_shift($zipper)) { if ($x < 0) $z = $x; else if ($z) { $spans[] = array($a, -$z); $a = $x; $z = null; } } $spans[] = array($a, -$z); // Insert the prefix/suffix in the start/end locations $n = count($spans); while ($n--) $data = substr($data, 0, $spans[$n][0]) . $prefix . substr($data, $spans[$n][0], $spans[$n][1] - $spans[$n][0]) . $suffix . substr($data, $spans[$n][1]); return $data; }

Steve · Answer 2 · 2012-04-30T00:02:30+0000

I had to rethink this question today and write a better version above. I will turn it on here. This same idea is only easier to read and should work better because it uses arrays instead of concatenation.

 <?php function highlight_range_sort($a, $b) { $A = abs($a); $B = abs($b); if ($A == $B) return $a < $b ? 1 : 0; else return $A < $B ? -1 : 1; } function highlightKeywords($data, $keywords = array(), $prefix = '<span class="highlight">', $suffix = '</span>') { $datacopy = strtolower($data); $keywords = array_map('strtolower', $keywords); // this will contain offset ranges to be highlighted // positive offset indicates start // negative offset indicates end $ranges = array(); // find start/end offsets for each keyword foreach ($keywords as $keyword) { $offset = 0; $length = strlen($keyword); while (($pos = strpos($datacopy, $keyword, $offset)) !== false) { $ranges[] = $pos; $ranges[] = -($offset = $pos + $length); } } if (!count($ranges)) return $data; // sort offsets by abs(), positive usort($ranges, 'highlight_range_sort'); // combine overlapping ranges by keeping lesser // positive and negative numbers $i = 0; while ($i < count($ranges) - 1) { if ($ranges[$i] < 0) { if ($ranges[$i + 1] < 0) array_splice($ranges, $i, 1); else $i++; } else if ($ranges[$i + 1] < 0) $i++; else array_splice($ranges, $i + 1, 1); } // create substrings $ranges[] = strlen($data); $substrings = array(substr($data, 0, $ranges[0])); for ($i = 0, $n = count($ranges) - 1; $i < $n; $i += 2) { // prefix + highlighted_text + suffix + regular_text $substrings[] = $prefix; $substrings[] = substr($data, $ranges[$i], -$ranges[$i + 1] - $ranges[$i]); $substrings[] = $suffix; $substrings[] = substr($data, -$ranges[$i + 1], $ranges[$i + 2] + $ranges[$i + 1]); } // join and return substrings return implode('', $substrings); } // Example usage: echo highlightKeywords("This is a test.\n", array("is"), '(', ')'); echo highlightKeywords("Classes are as hard as they say.\n", array("as", "class"), '(', ')'); // Output: // Th(is) (is) a test. // (Class)es are (as) hard (as) they say.

sanbikinoraion · Answer 3 · 2012-05-01T16:27:10+0000

OP is what is unclear in the question of whether $ data can contain HTML from get-go. Can you clarify this?

If $ data may contain HTML itself, you will fall into areas trying to parse an irregular language using a regular language parser, and this will not work well.

In this case, I would suggest loading $ HTML data into PHP DOMDocument, getting all the text nodes and running one of the other perfectly suitable answers to the contents of each text block in turn.

Keyword highlighting highlights highlights in PHP preg_replace ()

More articles: