Which string distance algorithm is best for measuring input accuracy?

I am trying to write a function that determines how accurately the user enters a specific phrase / sentence / word / words. My goal is to create an application to teach the user the accuracy of entering certain phrases.

My initial instinct was to use the basic levenshtein distance algorithm (mainly because the only algorithm I knew from my head).

But after a bit more research, I saw that Jaro-Winkler is a slightly more interesting algorithm from consideration for transpositions.

I even found a link that talks about the differences between these algorithms:

The difference between the distance between Jaro-Winkler and Levenshtein?

Having read all this, in addition to the relevant Wikipedia articles, I still don't know a bit which algorithm fits my purpose.

+5
source share
3 answers

Since you evaluate the quality of typing, and want to teach the student to make zero mistakes, you should use the Levenshtein distance because it is less forgiving.

In addition, Levenshtein's estimate is more intuitive and easier to graphically represent than the results of Jaro-Winkler. You can modify the Levenshtein algorithm to report insertions, deletions, and errors separately and display a list of fixes to end users. Jaro-Winkler, on the other hand, gives you an estimate that is difficult to show to the end user because the spelling penalties in the middle are lower than the penalties at the end.

+4
source

Slightly frowned, but only slightly: create a generative model for input that gives a high (previous) probability of hitting the right letter and distributes some probabilities for hitting two adjacent keys at the same time, two keys from different hands in the wrong order, two keys from one hand to in the wrong order, the key is next to the right one, the key is far from the right one, etc. Or perhaps less ad-hoc: give your model the likelihood of a keystroke for a given sequence taking into account the current key pair needed to continue the passage. You could do a lot with such a model; for example, you can get the label โ€œdistanceโ€ by indicating a probability score for the studentโ€™s actual performance. But it would be even better to give them a report that summarizes what mistakes they make the most - in the end, why cook their performance to one number when many numbers will make? Bonus points if you find out the probabilities of various errors from the large corps of real typist work.

+3
source

I basically agree with the answer to this dasblinkenlight, however, I would like to suggest using Damerau-Levenshtein instead of only Levenshtein, which is, including transpositions. Transpositions are quite often and easily performed when typing, and there are no good reasons why they should have a double distance limitation with respect to other possible errors (insert, delete and replace).

+1
source

Source: https://habr.com/ru/post/1262655/


All Articles