OCR: how to increase accuracy - existing libraries for removing non-text "furniture", forms, etc., to avoid confusion with OCR?

I want to remove rectangles, etc. that enclose the text in a screenshot image, so that I can perform optical character recognition to get the exact text from the screen.

Background:

I am doing this to extract data from an outdated application for use with other applications. This is the only way to get this data, since the linked files are in a private, proprietary binary format.

I will use AutoItScript to control the application to display data in its user interface, then I will take it off the screen and send it to tesseract.

I already had some success in UI automation, and I was able to use tesseract to get plain ascii text from a bitmap.

There are several AutoItScripr forum articles discussing its use with tesseract / OCR, but not specifically for my question. http://www.autoitscript.com/forum/index.php?s=6c32c3ece12756e635a619cdf175eff9&showforum=2

What should I do

There are thin rectangles with a width of 1 pixel, which are closely connected with the text when applying to tesseract, it sees them like me, for example, for a line in the form of a rectangle.

Any thoughts on how to remove rectangles or best practices?

, , , .png . .png , tesseract.

/ tesseract, , :

: http://code.google.com/p/tesseract-ocr/downloads/list - Tesseract ascii- tesseract-2.00.eng.tar.gz( : " Tesseract (2.00) ) 2007 989 KB 84845" )

Stack Overflow

. .

+3
1

, , , , , , . , .

, , , . , , Graphics.DrawRectangle Pens.Transparent. , , , . , .

+1

Source: https://habr.com/ru/post/1737010/


All Articles