PHP, document reader library

I need a library to extract text from documents (doc, doxc, pdf, html, rtf, odt .....). Is there one library for this purpose (for all types of documents)?

+4
source share
4 answers

On systems other than Windows, there is no such library for you, and there is a high probability that this will not happen in the future. The main reason is that the document formats you set are constantly updated from time to time.

On Windows, however, if you have php installed, you can definitely use activex extensions to read all of these formats with ease, and you only need the appropriate office application to install separately from php on the machine to get it working. It also ensures that future versions of documents will continue to work in your php code if your office applications can read this document. Look for win32 php libraries in php library libraries and you should find a nice one there

0
source

Batch convert files to one format using

odtphp http://www.odtphp.com/index.php?i=tutorials&p=tutorial1

or

PyODConverter (run this using the PHP command line executable to make it "work" with php) http://www.oooninja.com/2008/02/batch-command-line-file-conversion-with.html

Then run this last result through any shared pdf2txt or phpOCR library.

+2
source

A safer bet would be to first convert your documents into plain text and then analyze the contents of the text version to do whatever you want. There are many command line converters that allow you to convert from different formats to plain text ( Word to txt , PDF to txt , etc.), in ANY operating system.

BTW Regarding PDF files: not all of them actually contain plain text, some of them are just a collection of scanned images, so in this case you are out of luck (if you do not use OCR on them).

+2
source

OpenTBS is a PHP tool that can read the modification of the contents of any OpenDocument files (ODT, ODS, ODG, ODF, ODM, ODP, OTT, OTS, OTG, OTP). But also OpenXML files (DOCX, XLSX, PPTX).

If you can convert files with an unsupported format, you need one of those supported by OpenTBS, then this is done.

+1
source

Source: https://habr.com/ru/post/1335440/


All Articles