Using C # to search for OCR (for search) PDF

Question

Using C # to search for OCR (for search) PDF

I need to extract text from a PDF that has already been converted using OCR. Do I use regular PDFReader to get text or do I need to convert PDF to OCR?

+4

pdf ocr

enamrik Feb 16 '11 at 17:08

source share

2 answers

There are a number of commercial SDKs for processing PDF files. http://www.foxitsoftware.com/pdf/sdk/activex/ Here are foxit's.

0

VoronoiPotato Feb 16 '11 at 17:11

source share

plinth · Accepted Answer · 2011-02-16T20:08:39+0000

It depends on how it was converted. Many OCR applications somehow place text under the image. Some do this by first putting the text down, placing the image on top. Some place a snapshot at the bottom, and then put the text on top, using the “Do Not Mark” transmission mode.

I mention this because I cannot predict how any text selection tool will respond to transparent text. Theoretically, it should just give you text (this is what Acrobat does). Does anyone realize if this is really happening in all text extraction tools.

Using C # to search for OCR (for search) PDF

More articles: