I used the adist function in R, it calculates the Levenshtein distance between two lines of characters. Here is an example of reproducibility:
>a <- c("bonjour", "bonsoir", "good morning", "hello world")
>b <- c("maman", "bienjoue", "printemps")
>adist(a, b, counts = TRUE)
The results I get are as follows:
[,1] [,2] [,3]
[1,] 7 3 8
[2,] 7 5 8
[3,] 10 11 12
[4,] 11 10 11
attr(,"counts")
, , ins
[,1] [,2] [,3]
[1,] 0 1 2
[2,] 0 1 2
[3,] 0 0 1
[4,] 0 1 0
, , del
[,1] [,2] [,3]
[1,] 2 0 0
[2,] 2 0 0
[3,] 7 4 4
[4,] 6 4 2
, , sub
[,1] [,2] [,3]
[1,] 5 2 6
[2,] 5 4 6
[3,] 3 7 7
[4,] 5 5 9
attr(,"trafos")
[,1] [,2] [,3]
[1,] "SSSSSDD" "MSIMMMMS" "SSIMSSSSI"
[2,] "SSSSSDDS" "MSIMSMSS" "SSIMSSSSI"
[3,] "SSDDDMSDDDMDD" "SSSSSDMSSDDDD" "SSSSSIMSSDDDD"
[4,] "SSSSSDDDDDDD" "SIMSSMSSDDDD" "SSSSSSSSSDDD"
In cell [4, 1] you can see that he performed 6 deletions and 5 replacements and 0 inserts, however, if you look at the “trafos” attribute for this cell, it displays 5 times S and 7 times D total of 12 changes, when the distance is 11 (adds an extra D).
This is when we calculate the Levenshtein distance between "hello world" and "maman".
If I apply adist directly to these two, and not to two vectors, I get the following:
>adist("hello world","maman",counts = TRUE)
[,1]
[1,] 11
attr(,"counts")
, , ins
[,1]
[1,] 0
, , del
[,1]
[1,] 6
, , sub
[,1]
[1,] 5
attr(,"trafos")
[,1]
[1,] "SSSSSDDDDDD"
What seems right in this case.
"adist" ( )?