Can I access a text overlay from a searchable PDF?

Question

Can I access a text overlay from a searchable PDF?

I understand that there is a difference between a PDF file and a searchable PDF file. Text search PDF files have a text overlay that is used for search. Is it possible to extract this text into a text file? Perhaps using the Adobe API?

+4

pdf ocr

bheussler Oct 4 '12 at 16:00

source share

1 answer

Kurt pfeifle · Accepted Answer · 2012-10-04T23:43:51+0000

A searchable file is not an official definition, but it is a commonly used expression.

If the standard PDF contains all the embedd fonts that it uses, and if these fonts do not use a custom encoding, this is most likely a “search”: this means that you can copy “paste text from it” and you can extract text from it ( and tools like pdftotext work more or less flawlessly). This has nothing to do with text overlay; it is the standard PDF architecture.

What you describe as “text overlay” is what you can add to a scanned PDF file. PDF files created from scanning are full-page images, usually TIFFs, that are embedded in (otherwise empty) PDF pages. Then, in a further step, a “text overlay” is added by running OCR (optical character recognition) against it. This provides a "searchable" otherwise fuzzy "pixel" PDF.

If such a “text overlay” PDF file does not use strange designs around its fonts, then it is easy to extract it from the * .txt file. In the end, running OCR on an image-only PDF file consists of adding searchable text:

Install pdftotext (available for Linux, Unix, Windows, Mac OS X), and then try running:
```
 pdftotext -layout some-input.pdf some-input.txt 
```

Cautions , most OCRs are far from perfect. If you had a recognition rate of 99% for all characters, you're in luck. (But this means: about 10% of all words and about 100% of all sentences contain an error - something that will give you a guaranteed rejection in high school ...)

It should also be noted that these “text overlays” are technically identical to any other text section in PDF files (except that they contain more spelling and grammar errors :-) - but they use a special text rendering mode (mode 3 ), which is described like "Neither fill in nor stroke text (invisible)." Although it is “invisible,” you can still select, copy, or unzip these text sections.

Can I access a text overlay from a searchable PDF?

More articles: