Performance issues using OCR Tesseract from a Python application

Question

Performance issues using OCR Tesseract from a Python application

I recently put together an interface to scan and download searchable documents in KnowledgeTree, our document management system. We have access to many separate tools for different parts of this process, but I wanted to combine everything into one interface so that everything was simple for users.

Here is the platform:

#    OS: Ubuntu Desktop 10.04
#    GUI Toolkit: wxPython
#    OCR package: Tesseract 3.00 (compiled executable)

And here is the main process:

#    1. Retrieve individual page images from scanner
#    2. Call Tesseract OCR executable to produce HOCR data for each page
#    3. Run extracted words against English dictionary to guess if page orientation is correct
#        3a. If word matches are below threshold, rotate page 90 degrees and try again
#    4. Detect document type and retrieve metadata from HOCR data
#    5. Merge scanned pages and HOCR data into a finished PDF
#    6. Upload PDF and attached metadata to document management system through KnowledgeTree API

, , 2 . , , , . , 4 , . , , , ABBYY OCR 50 , ( , ABBYY ). , , 1-3.

, -, , OCR/upload , - OCR , , Python. , , ?

, , , 4, .

+3

python ocr tesseract

robots.jpg 21 . '11 21:26

1

Neskie · Accepted Answer · 2011-01-27T07:03:26+0000

, , , Tesseract OCR, . - tesseract , 3.0, , .

, , .

, 1.5, , , .

OCRfeeder, .

Performance issues using OCR Tesseract from a Python application

More articles: