Which string distance algorithm is best for measuring input accuracy?

Question

Which string distance algorithm is best for measuring input accuracy?

I am trying to write a function that determines how accurately the user enters a specific phrase / sentence / word / words. My goal is to create an application to teach the user the accuracy of entering certain phrases.

My initial instinct was to use the basic levenshtein distance algorithm (mainly because the only algorithm I knew from my head).

But after a bit more research, I saw that Jaro-Winkler is a slightly more interesting algorithm from consideration for transpositions.

I even found a link that talks about the differences between these algorithms:

The difference between the distance between Jaro-Winkler and Levenshtein?

Having read all this, in addition to the relevant Wikipedia articles, I still don't know a bit which algorithm fits my purpose.

+5

algorithm nlp levenshtein distance jaro-winkler

adrianmc Jan 11 '17 at 21:34

source share

3 answers

Slightly frowned, but only slightly: create a generative model for input that gives a high (previous) probability of hitting the right letter and distributes some probabilities for hitting two adjacent keys at the same time, two keys from different hands in the wrong order, two keys from one hand to in the wrong order, the key is next to the right one, the key is far from the right one, etc. Or perhaps less ad-hoc: give your model the likelihood of a keystroke for a given sequence taking into account the current key pair needed to continue the passage. You could do a lot with such a model; for example, you can get the label “distance” by indicating a probability score for the student’s actual performance. But it would be even better to give them a report that summarizes what mistakes they make the most - in the end, why cook their performance to one number when many numbers will make? Bonus points if you find out the probabilities of various errors from the large corps of real typist work.

+3

Daniel Wagner Jan 11 '17 at 21:48

source share

I basically agree with the answer to this dasblinkenlight, however, I would like to suggest using Damerau-Levenshtein instead of only Levenshtein, which is, including transpositions. Transpositions are quite often and easily performed when typing, and there are no good reasons why they should have a double distance limitation with respect to other possible errors (insert, delete and replace).

+1

fnl Jan 12 '17 at 14:56

source share

dasblinkenlight · Accepted Answer · 2017-01-11T21:48:19+0000

Since you evaluate the quality of typing, and want to teach the student to make zero mistakes, you should use the Levenshtein distance because it is less forgiving.

In addition, Levenshtein's estimate is more intuitive and easier to graphically represent than the results of Jaro-Winkler. You can modify the Levenshtein algorithm to report insertions, deletions, and errors separately and display a list of fixes to end users. Jaro-Winkler, on the other hand, gives you an estimate that is difficult to show to the end user because the spelling penalties in the middle are lower than the penalties at the end.

Which string distance algorithm is best for measuring input accuracy?

More articles: