Matching substring without accent

Question

Matching substring without accent

I have a search function that retrieves data from an InnoDB table ( utf8_spanish_ci collation) and displays it in an HTML document ( UTF-8 charset). The user enters a substring and receives a list of matches in which the first occurrence of the substring is highlighted, for example:

 Matches for "AL": Álava <strong>Al</strong>bacete <strong>Al</strong>mería Ciudad Re<strong>al</strong> Málaga

As you can see from the example, the search ignores the differences in case and accent (MySQL will take care of this automatically). However, the code I use for matches does not execute the last:

 <?php private static function highlightTerm($full_string, $match){ $start = mb_stripos($full_string, $match); $length = mb_strlen($match); return htmlspecialchars( mb_substr($full_string, 0, $start)) . '<strong>' . htmlspecialchars( mb_substr($full_string, $start, $length) ) . '</strong>' . htmlspecialchars( mb_substr($full_string, $start+$length) ); } ?>

Is there any reasonable way to fix this that doesn't imply hard coding of all the possible options?

Update: System Specifications: PHP / 5.2.14 and MySQL / 5.1.48

+2

php utf-8 collation

Álvaro González Aug 27 '10 at 9:36

source share

2 answers

use PEAR I18N_UnicodeNormalizer-1.0.0

 include('…'); echo preg_replace( '/(\P{L})/ui', // replace all except members of Unicode class "letters", case insensitive '', // with nothing → drop accents I18N_UnicodeNormalizer::toNFKD('ÅÉÏÔÙåéïôù') // ù → u + ` );

→ AEIOUaeiou

+1

eleg Oct 18 '10 at 10:38

source share

Gumbo · Accepted Answer · 2010-08-27T09:58:10+0000

You can use Normalizer to normalize the string to Normalization Form KD (NFKD) , where the characters are decomposed, therefore Á (U + 00C1) is decomposed into a combination of the letter A (U + 0041) and the combiner ́ U + 0301):

 $str = Normalizer::normalize($str, Normalizer::FORM_KD);

Then you modify the search pattern according to these optional marks:

 $pattern = '/('.preg_replace('/\p{L}/u', '$0\p{Mn}?', preg_quote($term, '/')).')/ui';

Replacement is done using preg_replace :

 preg_replace($pattern, '<strong>$0</strong>', htmlspecialchars($str))

So the complete method:

 private static function highlightTerm($str, $term) { $str = Normalizer::normalize($str, Normalizer::FORM_KD); $pattern = '/('.preg_replace('/\p{L}/u', '$0\p{Mn}?', preg_quote($term, '/')).')/ui'; return preg_replace($pattern, '<strong>$0</strong>', htmlspecialchars($str)); }

Matching substring without accent

More articles: