MySQL - matching Latin (English) input form with utf8 (non-English) data

I maintain a music database in mySQL, how do I return results stored in, for example, “Tiësto” when people search for “Tiesto”?

All data is stored when indexing the full text, if that matters.

I already use a combination of Levenshtein in PHP and REGEXP in SQL - I’m not trying to solve this problem, but just for greater accessibility for search in general.

PHP:

function Levenshtein($word) { $words = array(); for ($i = 0; $i < strlen($word); $i++) { $words[] = substr($word, 0, $i) . '_' . substr($word, $i); $words[] = substr($word, 0, $i) . substr($word, $i + 1); $words[] = substr($word, 0, $i) . '_' . substr($word, $i + 1); } $words[] = $word . '_'; return $words; } $fuzzyartist = Levenshtein($_POST['searchartist']); $searchimplode = "'".implode("', '", $fuzzyartist)."'"; 

MySql:

 SELECT * FROM new_track_database WHERE artist REGEXP concat_ws('|', $searchimplode); 

To add, I often perform character set conversions and lowercase sanitation in PHP, but they have always been ANOTHER way - standardizing non-Latin characters. I can't think of an oppsite process, but only in certain circumstances based on the data that I saved.

+1
php regex mysql search levenshtein distance
03 Oct '14 at 19:03
source share
1 answer

A possible solution would be to create another column in the database next to the artist, for example artist_normalized. Here, filling out the table, you can insert a "normalized" version of the row. A search can then be performed against the artist_normalized column.

Security Code:

 <?php $transliterator = Transliterator::createFromRules(':: NFD; :: [:Nonspacing Mark:] Remove; :: NFC;', Transliterator::FORWARD); $test = ['abcd', 'èe', '€', 'àòùìéëü', 'àòùìéëü', 'tiësto']; foreach($test as $e) { $normalized = $transliterator->transliterate($e); echo $e. ' --> '.$normalized."\n"; } ?> 

Result:

 abcd --> abcd èe --> ee € --> € àòùìéëü --> aouieeu àòùìéëü --> aouieeu tiësto --> tiesto 

The magic is performed by the Transliterator class. The specified rule performs three actions: decomposes the string, deletes diacritical characters, and then reorders the canonized string. The transliterator in PHP is built on top of the ICU, so when you do this, you rely on the ICU library tables, which are complete and reliable.

Note. This solution requires PHP 5.4 or higher with an intl extension.

+1
Oct 03 '14 at 20:37
source share



All Articles