Is the PHP levenshtein () function an error?

On this levenshtein () page, I use example # 1 with the following variables:

// input misspelled word $input = 'htc corporation'; // array of words to check against $words = array('htc', 'Sprint Nextel', 'Sprint', 'banana', 'orange', 'radish', 'carrot', 'pea', 'bean'); 

Can someone tell me why the expected result is carrots rather than htc ? Thanks

+4
source share
3 answers

Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of one-character changes (insertion, deletion, substitution) required to change one word to another.

here is a simple analysis

 $input = 'htc corporation'; // array of words to check against $words = array( 'htc', 'Sprint Nextel', 'Sprint', 'banana', 'orange', 'radish', 'carrot', 'pea', 'bean' ); foreach ( $words as $word ) { // Check for Intercept $ic = array_intersect(str_split($input), str_split($word)); printf("%s \tl= %s , s = %s , c = %d \n",$word , levenshtein($input, $word), similar_text($input, $word), count($ic)); } 

Exit

 htc l= 12 , s = 3 , c = 5 Sprint Nextel l= 14 , s = 3 , c = 8 Sprint l= 12 , s = 1 , c = 7 banana l= 14 , s = 2 , c = 2 orange l= 12 , s = 4 , c = 7 radish l= 12 , s = 3 , c = 5 carrot l= 11 , s = 1 , c = 10 pea l= 13 , s = 2 , c = 2 bean l= 13 , s = 2 , c = 2 

Clear the htc distance 12 , while the carrot has 11 , if you want htc then Levenshtein was missing .. you need to compare the exact word and then set priorities

+2
source

Since the distance of levenshtein from htc corporation is 12, while the distance to carrots is only 11.

The levenshtein function calculates how many characters it has to add or replace to get a specific word, and since htc corporation has 12 extra characters than htc , it has to remove 12 to get only htc . To go to the word carrot from htc corporation , 11 changes are required.

+4
source

"htc corporation" on "htc" has a distance of 12 (delete "corporation" = 12 characters). "htc corporation" on "carrots" has a distance of not more than 11.

"htc corporation" => "corporation": 4
"corporation" => "corporat": 3
"corporat" => "corrat": 2
"corrat" => "carrat": 1
"carrat" => "carrots": 1

4 + 3 + 2 + 1 + 1 = 11

It looks like you can look for not the direct levenshtein distance, but the “closest substring” match. There's an example of implementing such a thing using a modified Levenshtein algorithm here . Using this algorithm, you will get:

htc: 0
Sprint Nextel: 11
Sprint: 4
banana: 5
orange: 3
radish: 3
carrots: 3
peas: 2
bean: 3

which recognizes "htc" as an exact subscript match and gives it zero. The runner-up, “peas,” has a batch of two, because you can align it with “p,” “e,” or “a,” in the corporation, and then replace the other two characters, etc. When working with this algorithm, you should know that the score will never be higher than the length of the “needle” line, so shorter lines will usually get lower scores (they are “easier to match”).

+3
source

Source: https://habr.com/ru/post/1494988/


All Articles