Get text position with tesseract 2.04 and Java

Question

Get text position with tesseract 2.04 and Java

I am doing OCR using Tesseract 2.04 on some images, and now I have to get the exact position of the text outlined. But this version does not return this information.

I need this to create a searchable PDF file. I already learned to print text in the bottom pdf layer, but I need a position for stamping this text. My first idea is to execute in pdf, get the text and position of the text, print to pdf using iText api.

+4

java pdf ocr itext tesseract

Raduan santos Dec 05 '11 at 19:00

source share

1 answer

Joris schellekens · Accepted Answer · 2017-07-18T09:53:58+0000

Inside iText, we also studied OCR. And it is possible (using Tesseract).

the working process:

extract all images from pdf using iText
extract text (and coordinates, font, etc.) with Tesseract
apply coordinate transformations (since the tesseract coordinate system and the iText coordinate system do not match)
add layer to pdf (canvas.beginLayer)
draw all the text in this layer in the correct position.

There are many more optimizations you could do. Short list of offers:

correct baseline
correct font
correct spelling errors
rate color
rate background color

This is not an easy task. But certainly possible.

Get text position with tesseract 2.04 and Java

More articles: