I recently put together an interface to scan and download searchable documents in KnowledgeTree, our document management system. We have access to many separate tools for different parts of this process, but I wanted to combine everything into one interface so that everything was simple for users.
Here is the platform:
And here is the main process:
# 1. Retrieve individual page images from scanner
# 2. Call Tesseract OCR executable to produce HOCR data for each page
# 3. Run extracted words against English dictionary to guess if page orientation is correct
# 3a. If word matches are below threshold, rotate page 90 degrees and try again
# 4. Detect document type and retrieve metadata from HOCR data
# 5. Merge scanned pages and HOCR data into a finished PDF
# 6. Upload PDF and attached metadata to document management system through KnowledgeTree API
, , 2 . , , , . , 4 , . , , , ABBYY OCR 50 , ( , ABBYY ). , , 1-3.
, -, , OCR/upload , - OCR , , Python. , , ?
, , , 4, .