How to convert PDF to text so that I can parse this text using PHP?

I have PDF files that are basically just formatted, and I want to parse text using PHP. I understand that the PDF file is binary, so I need a utility or library to convert it to text.

Any recommendations?

+6
source share
3 answers

I ended up using XPDF (which includes pdftotext). This works great, and I use it in production to extract text from millions of PDF files uploaded to our servers.

The following is the installation process for Linux CentOS:

  • download version 3.03 from here: http://foolabs.com/xpdf/download.html
  • tar -zxvf xpdfbin-linux-3.03.tar.gz (extract tar.gz)
  • create the necessary directories for installation (some or all of them may already exist)
    • sudo mkdir / usr / local / man /
    • sudo mkdir / usr / local / man / man1 /
    • sudo mkdir / usr / local / man / man5 /
    • sudo mkdir / usr / local / etc / xpdfrc /
  • move files from extracted folders (cd to the folder where xpdf was simply unpacked)
    • move all executables from the bin64 directory (xpdf, pdftotext ... all files) to / usr / local / bin /
    • move the sample-xpdfrc file to / usr / local / etc / xpdfrc (this can be used as is)
    • move the manual pages from the doc directory (* .1 to / usr / local / man / man1 / and * .5 to / usr / local / man / man 5 /)
  • xpdf must be installed and ready to use
  • you can delete the downloaded tar.gz file and the folder in which it was unpacked.
+4
source

Third-party software can download the text content of a PDF file, for example:

  • xdoc2txt (Windows only, used in WinMerge plugins)
  • pdftotext, part of xpdf
+4
source

You cannot do this with file_get_contents() , because PDF files contain only binary data (without plain text). To read / modify the PDF file, you can use some third-party libraries. Take a look at:

And don't forget

+1
source

Source: https://habr.com/ru/post/891192/


All Articles