Inside iText, we also studied OCR. And it is possible (using Tesseract).
the working process:
- extract all images from pdf using iText
- extract text (and coordinates, font, etc.) with Tesseract
- apply coordinate transformations (since the tesseract coordinate system and the iText coordinate system do not match)
- add layer to pdf (canvas.beginLayer)
- draw all the text in this layer in the correct position.
There are many more optimizations you could do. Short list of offers:
- correct baseline
- correct font
- correct spelling errors
- rate color
- rate background color
This is not an easy task. But certainly possible.
source share