How to convert PDF to text so that I can parse this text using PHP?

Question

How to convert PDF to text so that I can parse this text using PHP?

I have PDF files that are basically just formatted, and I want to parse text using PHP. I understand that the PDF file is binary, so I need a utility or library to convert it to text.

Any recommendations?

+6

linux import php pdf

T. Brian Jones Jun 23 '11 at 9:00

source share

3 answers

Third-party software can download the text content of a PDF file, for example:

xdoc2txt (Windows only, used in WinMerge plugins)
pdftotext, part of xpdf

+4

Benoit Jun 23 '11 at 9:32

source share

You cannot do this with file_get_contents() , because PDF files contain only binary data (without plain text). To read / modify the PDF file, you can use some third-party libraries. Take a look at:

And don't forget

http://php.net/manual/en/book.pdf.php

+1

technology Jun 23 '11 at 9:15

source share

T. Brian Jones · Accepted Answer · 2012-11-06T05:38:58+0000

I ended up using XPDF (which includes pdftotext). This works great, and I use it in production to extract text from millions of PDF files uploaded to our servers.

The following is the installation process for Linux CentOS:

download version 3.03 from here: http://foolabs.com/xpdf/download.html
tar -zxvf xpdfbin-linux-3.03.tar.gz (extract tar.gz)
create the necessary directories for installation (some or all of them may already exist)
- sudo mkdir / usr / local / man /
- sudo mkdir / usr / local / man / man1 /
- sudo mkdir / usr / local / man / man5 /
- sudo mkdir / usr / local / etc / xpdfrc /
move files from extracted folders (cd to the folder where xpdf was simply unpacked)
- move all executables from the bin64 directory (xpdf, pdftotext ... all files) to / usr / local / bin /
- move the sample-xpdfrc file to / usr / local / etc / xpdfrc (this can be used as is)
- move the manual pages from the doc directory (* .1 to / usr / local / man / man1 / and * .5 to / usr / local / man / man 5 /)
xpdf must be installed and ready to use
you can delete the downloaded tar.gz file and the folder in which it was unpacked.

How to convert PDF to text so that I can parse this text using PHP?

More articles: