Good metric for fuzzy matching without word order

I'm looking for a good metric (cosine, chapman, jaccard, jaro, dice, etc.) to fuzzy string matching without regard to word order. I am open to using a combination of some indicators.

For instance:

'john rambo' == 'jovn rambo'
'john rambo' == 'rambo jovn'
'john rambo' == 'john rambo x'
'john rambo the vietnam veteran' == 'john rambo the vietnam us veteran'

but

'john kerry' != 'john rambo'

I aim to detect similar strings when we have a typo, one letter or a word added (for the last, the compared strings should have a reasonable length to say that they are similar to an additional word placed in one of them).

+4
source share
1 answer

: , , > 75%.

Java:

String str1 = "john rambo the vietnam veteran";
String str2 = "jovn rabbo the vittnam us vetteran";

:

ArrayList<String> a = new ArrayList<String> (Arrays.asList(str1.split(" ")));
ArrayList<String> b = new ArrayList<String> (Arrays.asList(str2.split(" ")));

:

boolean are_equal = true;
boolean word_found = false;

if (a.size() < b.size())
{
    for (String a_word : a)
    {
        word_found = false;

        for (int i=0; i<b.size(); i++)
        {
            String b_word = b.get(i);

            if (is_similar_enough (a_word, b_word))
            {
                word_found = true;
                b.remove(i);
                break;
            }
        }

        if (!word_found)
        {
            are_equal = false;
            break;
        }
    }
}
else
{
   // ..
}

: , , , ( ) 75% .

is_similar_enough:

public boolean is_similar_enough (String a, String b)
{
    int equivlantChars = 0;

    if (a.length() != b.length())
        return false;

    for (int i=0; i<a.length(); i++)
        if (a.toCharArray[i] == b.toCharArray[i])
            equivlantChars ++;

    return ((((double)equivlantChars) / ((double)a.length())) >= 0.75);
}
+1

Source: https://habr.com/ru/post/1530059/


All Articles