Get text position with tesseract 2.04 and Java

I am doing OCR using Tesseract 2.04 on some images, and now I have to get the exact position of the text outlined. But this version does not return this information.

I need this to create a searchable PDF file. I already learned to print text in the bottom pdf layer, but I need a position for stamping this text. My first idea is to execute in pdf, get the text and position of the text, print to pdf using iText api.

+4
source share
1 answer

Inside iText, we also studied OCR. And it is possible (using Tesseract).

the working process:

  • extract all images from pdf using iText
  • extract text (and coordinates, font, etc.) with Tesseract
  • apply coordinate transformations (since the tesseract coordinate system and the iText coordinate system do not match)
  • add layer to pdf (canvas.beginLayer)
  • draw all the text in this layer in the correct position.

There are many more optimizations you could do. Short list of offers:

  • correct baseline
  • correct font
  • correct spelling errors
  • rate color
  • rate background color

This is not an easy task. But certainly possible.

+6
source

Source: https://habr.com/ru/post/1384757/


All Articles