How to extract text from a PDF document?

Question

How to extract text from a PDF document?

How to extract text from a PDF using PHP?

(I can not use other tools, I do not have root access)

I found some functions that work for plain text, but they do not handle Unicode characters well:

http://www.hashbangcode.com/blog/zend-lucene-and-pdf-documents-part-2-pdf-data-extraction-437.html

+43

php text unicode pdf

Sfisioza Aug 09 '11 at 16:55

source share

2 answers

I know this topic is quite old, but this need is still alive. I read a lot of documents, a forum and a script and built a new advanced one that supports compressed and uncompressed pdf:

https://gist.github.com/smalot/6183152

Hope this helps Everone

+9

Sebastien Malot Aug 08 '13 at 9:39

source share

Pedro Lobito · Accepted Answer · 2011-08-09 18:53

Download class.pdf2text.php @ https://pastebin.com/dvwySU1a (updated on April 5, 2014) or http://www.phpclasses.org/browse/file/31030.html (registration required)

the code:

include('class.pdf2text.php'); $a = new PDF2Text(); $a->setFilename('filename.pdf'); $a->decodePDF(); echo $a->output();

The class does not work with all the pdf I tested, try it and you might be lucky :)

If the above does not work, try http://pdfparser.org/

How to extract text from a PDF document?

More articles: