How to determine the similarity of characters?

I use Levenshtein distance to find similar lines after OCR. However, for some lines, the editing distance is the same, although the visual appearance is clearly different.

For example, the string Co will return these matches:

 CY (1) CZ (1) Ca (1) 

Given that Co is the result of the OCR engine, Ca will be a more likely match than those. Therefore, after calculating the Levenshtein distance, I would like to clarify the query result, sorting it by visual similarity. To calculate this similarity, I would like to use a standard sans-serif font, such as Arial.

Is there a library that I can use for this purpose, or how can I implement it myself? Alternatively, are there string similarity algorithms that are more accurate than the Levenshtein distance, which I could use additionally?

+2
source share
3 answers

If you are looking for a table that allows you to calculate the β€œreplacement cost” based on visual similarities, I have been looking for such a thing for a while with little success, so I started looking at it as a new problem. I do not work with OCR, but I am looking for a way to limit the search parameters in the probabilistic search for characters that are incorrectly typed . Since they are mistakenly introduced because the person visually confused the characters, the same principle should apply to you.

My approach was to classify letters based on their impact components in an 8-bit field. bit - from left to right:

 7: Left Vertical 6: Center Vertical 5: Right Vertical 4: Top Horizontal 3: Middle Horizontal 2: Bottom Horizontal 1: Top-left to bottom-right stroke 0: Bottom-left to top-right stroke 

For lowercase characters, the descriptors on the left are written in bit 1, and the descriptors on the right in bit 0, as diagonals.

With this scheme, I came up with the following meanings, which try to rank the characters according to visual similarity.

 m: 11110000: F0 g: 10111101: BD S,B,G,a,e,s: 10111100: BC R,p: 10111010: BA q: 10111001: B9 P: 10111000: B8 Q: 10110110: B6 D,O,o: 10110100: B4 n: 10110000: B0 b,h,d: 10101100: AC H: 10101000: A8 U,u: 10100100: A4 M,W,w: 10100011: A3 N: 10100010: A2 E: 10011100: 9C F,f: 10011000: 98 C,c: 10010100: 94 r: 10010000: 90 L: 10000100: 84 K,k: 10000011: 83 T: 01010000: 50 t: 01001000: 48 J,j: 01000100: 44 Y: 01000011: 43 I,l,i: 01000000: 40 Z,z: 00010101: 15 A: 00001011: 0B y: 00000101: 05 V,v,X,x: 00000011: 03 

This, so to speak, is too primitive for my purposes and requires more work. You can use it, however, or perhaps tailor it to suit your needs. The scheme is pretty simple. This rating is for monospatial fonts. If you use the sans-serif font, then you will probably have to redo the values.

This table is a hybrid table that includes all the characters, lower and upper case, but if you divide it only into upper case and only in lower case, this may turn out to be more efficient, and this will also allow you to apply a specific penalty case.

Keep in mind that these are early experiments. If you see a way to improve it (for example, by changing the sequence of bits), feel free to do it.

+3
source

Thus, in your remote function there are simply different costs for replacing different pairs of characters.

That is, instead of a replacement that adds a fixed cost of one or two, depending on the characters involved, you should instead use the replacement cost function , which returns something in between between 0.0 and 2.0 for the replacement cost of certain characters in certain contexts.

At each memoization step, simply call this cost function:

 cost[x][y] = min( cost[x-1][y] + 1, // insert cost[x][y-1] + 1, // delete, cost[x-1][y-1] + cost_to_replace(a[x],b[y]) // replace ); 

Here is my full implementation of Edit Distance, just replace the replace_cost constant for the replace_cost function as shown:

https://codereview.stackexchange.com/questions/10130/edit-distance-between-two-strings

In terms of implementing the cost_to_replace function, you need a cost-based character matrix based on how similar the characters are. There may be such a table floating around, or you can implement it yourself by writing each pair of characters to a pair of images, and then compare the images for similarity using standard vision methods.

Alternatively, you can use a controlled method by which you correct a few misinterpretations of OCR and pay attention to the entries in the table, which will then become the cost table above. (i.e. if OCR is wrong than the characters should be similar). A.

+2
source

In general, I saw Damerau-Levenshtein used much more often than just Levenshtein , and it basically adds a transpose operation. It should take into account more than 80% of human spelling mistakes, so you certainly should take this into account.

As for your specific problem, you can easily change the algorithm to increase the cost, replacing the capital letter with a fuzzy letter, and vice versa, to get something like this:

 dist(Co, CY) = 2 dist(Co, CZ) = 2 dist(Co, Ca) = 1 
+2
source

Source: https://habr.com/ru/post/1275234/


All Articles