I want to remove rectangles, etc. that enclose the text in a screenshot image, so that I can perform optical character recognition to get the exact text from the screen.
Background:
I am doing this to extract data from an outdated application for use with other applications. This is the only way to get this data, since the linked files are in a private, proprietary binary format.
I will use AutoItScript to control the application to display data in its user interface, then I will take it off the screen and send it to tesseract.
I already had some success in UI automation, and I was able to use tesseract to get plain ascii text from a bitmap.
There are several AutoItScripr forum articles discussing its use with tesseract / OCR, but not specifically for my question.
http://www.autoitscript.com/forum/index.php?s=6c32c3ece12756e635a619cdf175eff9&showforum=2
What should I do
There are thin rectangles with a width of 1 pixel, which are closely connected with the text when applying to tesseract, it sees them like me, for example, for a line in the form of a rectangle.
Any thoughts on how to remove rectangles or best practices?
, , , .png . .png , tesseract.
/ tesseract, , :
: http://code.google.com/p/tesseract-ocr/downloads/list
- Tesseract ascii- tesseract-2.00.eng.tar.gz( : " Tesseract (2.00) ) 2007 989 KB 84845" )
Stack Overflow
. .