Matching substring without accent

I have a search function that retrieves data from an InnoDB table ( utf8_spanish_ci collation) and displays it in an HTML document ( UTF-8 charset). The user enters a substring and receives a list of matches in which the first occurrence of the substring is highlighted, for example:

 Matches for "AL": Álava <strong>Al</strong>bacete <strong>Al</strong>mería Ciudad Re<strong>al</strong> Málaga 

As you can see from the example, the search ignores the differences in case and accent (MySQL will take care of this automatically). However, the code I use for matches does not execute the last:

 <?php private static function highlightTerm($full_string, $match){ $start = mb_stripos($full_string, $match); $length = mb_strlen($match); return htmlspecialchars( mb_substr($full_string, 0, $start)) . '<strong>' . htmlspecialchars( mb_substr($full_string, $start, $length) ) . '</strong>' . htmlspecialchars( mb_substr($full_string, $start+$length) ); } ?> 

Is there any reasonable way to fix this that doesn't imply hard coding of all the possible options?

Update: System Specifications: PHP / 5.2.14 and MySQL / 5.1.48

+2
source share
2 answers

You can use Normalizer to normalize the string to Normalization Form KD (NFKD) , where the characters are decomposed, therefore Á (U + 00C1) is decomposed into a combination of the letter A (U + 0041) and the combiner ́ U + 0301):

 $str = Normalizer::normalize($str, Normalizer::FORM_KD); 

Then you modify the search pattern according to these optional marks:

 $pattern = '/('.preg_replace('/\p{L}/u', '$0\p{Mn}?', preg_quote($term, '/')).')/ui'; 

Replacement is done using preg_replace :

 preg_replace($pattern, '<strong>$0</strong>', htmlspecialchars($str)) 

So the complete method:

 private static function highlightTerm($str, $term) { $str = Normalizer::normalize($str, Normalizer::FORM_KD); $pattern = '/('.preg_replace('/\p{L}/u', '$0\p{Mn}?', preg_quote($term, '/')).')/ui'; return preg_replace($pattern, '<strong>$0</strong>', htmlspecialchars($str)); } 
+5
source

use PEAR I18N_UnicodeNormalizer-1.0.0

 include('…'); echo preg_replace( '/(\P{L})/ui', // replace all except members of Unicode class "letters", case insensitive '', // with nothing → drop accents I18N_UnicodeNormalizer::toNFKD('ÅÉÏÔÙåéïôù') // ù → u + ` ); 

→ AEIOUaeiou

+1
source

Source: https://habr.com/ru/post/1343975/


All Articles