How to determine if a text file is converted via OCR

I want to make an application using C # to check a file whether it is converted via OCR or typed from the keyboard

+4
source share
3 answers

This task can be difficult to solve as a whole and can be easily solved for specific cases.

For example, if your OCR software inserts a bunch of non-ASCII characters, and all of your documents contain only letters from A to Z, lowercase letters az, numbers and punctuation, then your work is pretty simple.

To solve this problem, you can use for-loops on characters in the document and use if statements such as if(char.IsLetter(currentChar)) and if(char.IsDigit(currentChar)) , or use char.GetUnicodeCategory in the switch statement .

If there are certain words / letters, this always becomes wrong, you can make the Dictionary<string, bool> object and fill it with words that, as you know, OCR is always wrong, and / or words that you know, a person will not get it wrong . Then collapse all the words in your document and see if you have a match in the dictionary, proving that it is a person or OCR.

If you use OCR software that does not tend to be easily detected, you will have to resort to artificial intelligence to solve it. I hope you do not have to resort to this because it is really difficult to program and requires a lot of work to properly configure and maintain. From your description and your comments, it seems that you can use an easier solution.

Regardless of the fact that the software to perform this work should lead to the malfunctioning of some documents. The user can type something strange or copy / paste into some non-ASCII character (for example, the word résumé), or the OCR somehow does not detect any detectable errors. I hope you have a way to deal with this fact, or your situation is not risky enough that this is a problem.

+2
source

When I read something, I can usually tell if it was detected by seeing spelling errors that result from replacing mimicking characters with the correct ones. For example, O and O , S and S , 1 and l or I , rn and m , etc. If you write your program to look for these unusual anomalies, you may find OCRed text.

Similarly, you can look for other spelling errors that usually indicate typed text. For example, transposed letters ( teh ) or letters that are replaced next to them on the keyboard are likely indicators of text input.

+4
source

OCRed text almost always consists of single-line paragraphs. And OCR engines usually have problems distinguishing between some uppercase and lowercase letters and letters with similar glyphs, such as S / s, V / v, X / x, O / o / 0, 1 / l / I, etc. .

+1
source

Source: https://habr.com/ru/post/1345391/


All Articles