Using C # to search for OCR (for search) PDF

I need to extract text from a PDF that has already been converted using OCR. Do I use regular PDFReader to get text or do I need to convert PDF to OCR?

+4
source share
2 answers

It depends on how it was converted. Many OCR applications somehow place text under the image. Some do this by first putting the text down, placing the image on top. Some place a snapshot at the bottom, and then put the text on top, using the β€œDo Not Mark” transmission mode.

I mention this because I cannot predict how any text selection tool will respond to transparent text. Theoretically, it should just give you text (this is what Acrobat does). Does anyone realize if this is really happening in all text extraction tools.

+2
source

There are a number of commercial SDKs for processing PDF files. http://www.foxitsoftware.com/pdf/sdk/activex/ Here are foxit's.

0
source

Source: https://habr.com/ru/post/1340044/


All Articles