PHP - smart, error-comparable string comparison

Question

PHP - smart, error-comparable string comparison

I am looking for either a routine or a way of finding errors comparing string comparisons.

Let's say we have a Čakánka test string - yes, it contains CE characters.

Now I want to accept any of the following lines as OK :

cakanka
cákanká
ČaKaNKA
CAKANKA
CAAKNKA
Ckaanka
cakakNa

The problem is that I often switch letters in a word, and I want to minimize user frustration by not being able (that is, you are in a hurry) to write one word on the right.

So, I know how to make a ci comparison (just make it lowercase:]), I can remove the CE characters, I just can't wrap my head, wrapping a few switchable characters.

In addition, you often put one character not only in the wrong place ( character => cahracter ), but sometimes change it to several places ( character => carahcter ), simply because one finger is lazy while writing.

Thanks:]

+4

string comparison php

Adam kiss Feb 17 '10 at 23:20

source share

3 answers

You can translite words into Latin characters and use a phonetic algorithm such as Soundex to get the essence from your word and compare it to the ones you have. In your case, it will be C252 for all your words except the last one, which is C250 .

Edit The problem with comparative functions such as levenshtein or similar_text is that you need to call them for each pair of input values and a possible match. This means that if you have a database with 1 million records, you will need to call these functions 1 million times.

But features like soundex or metaphone that calculate some digest can help reduce the number of actual comparisons. If you save a soundex or metaphone for every known word in your database, you can very quickly reduce the number of possible matches. Later, when the set of possible match values decreases, you can use the comparative functions to get the best match.

Here is an example:

 // building the index that represents your database $knownWords = array('Čakánka', 'Cakaka'); $index = array(); foreach ($knownWords as $key => $word) { $code = soundex(iconv('utf-8', 'us-ascii//TRANSLIT', $word)); if (!isset($index[$code])) { $index[$code] = array(); } $index[$code][] = $key; } // test words $testWords = array('cakanka', 'cákanká', 'ČaKaNKA', 'CAKANKA', 'CAAKNKA', 'CKAANKA', 'cakakNa'); echo '<ul>'; foreach ($testWords as $word) { $code = soundex(iconv('utf-8', 'us-ascii//TRANSLIT', $word)); if (isset($index[$code])) { echo '<li> '.$word.' is similar to: '; $matches = array(); foreach ($index[$code] as $key) { similar_text(strtolower($word), strtolower($knownWords[$key]), $percentage); $matches[$knownWords[$key]] = $percentage; } arsort($matches); echo '<ul>'; foreach ($matches as $match => $percentage) { echo '<li>'.$match.' ('.$percentage.'%)</li>'; } echo '</ul></li>'; } else { echo '<li>no match found for '.$word.'</li>'; } } echo '</ul>';

+3

Gumbo Feb 17 '10 at 23:27

source share

Spell checking does something like comparing fuzzy strings . Perhaps you can adapt the algorithm based on this link. Or take spelling check guessing code from an open source project like Firefox .

+1

wallyk Feb 17 '10 at 23:26

source share

Pascal martin · Accepted Answer · 2010-02-17T23:26:56+0000

Not sure (especially about accents / special characters that you might have to deal with first), but for characters that are in the wrong place or missing, a levenshtein function that calculates Levenshtein > distance between two lines can help to you (quoting):

 int levenshtein ( string $str1 , string $str2 ) int levenshtein ( string $str1 , string $str2 , int $cost_ins , int $cost_rep , int $cost_del )

Levenshtein distance is defined as the minimum number of characters that you must replace, insert or delete convert str1 to str2

Other useful features can be soundex , similar_text or metaphone .

And some of the user notes on the manual pages of these functions, especially the levenshtein man page , can bring you useful things too; -)

PHP - smart, error-comparable string comparison

More articles: