I ended up using XPDF (which includes pdftotext). This works great, and I use it in production to extract text from millions of PDF files uploaded to our servers.
The following is the installation process for Linux CentOS:
- download version 3.03 from here: http://foolabs.com/xpdf/download.html
- tar -zxvf xpdfbin-linux-3.03.tar.gz (extract tar.gz)
- create the necessary directories for installation (some or all of them may already exist)
- sudo mkdir / usr / local / man /
- sudo mkdir / usr / local / man / man1 /
- sudo mkdir / usr / local / man / man5 /
- sudo mkdir / usr / local / etc / xpdfrc /
- move files from extracted folders (cd to the folder where xpdf was simply unpacked)
- move all executables from the bin64 directory (xpdf, pdftotext ... all files) to / usr / local / bin /
- move the sample-xpdfrc file to / usr / local / etc / xpdfrc (this can be used as is)
- move the manual pages from the doc directory (* .1 to / usr / local / man / man1 / and * .5 to / usr / local / man / man 5 /)
- xpdf must be installed and ready to use
- you can delete the downloaded tar.gz file and the folder in which it was unpacked.
source share