You can use a combination of shell utilities such as pdftotext for PDF files, wvWare for DOC, docx2txt.pl for DOCX, for example textractor rubygem does.
There are also native php classes for extracting PDF and docx .
Another rubigem that even does OCR for you, though Tesseract, docsplit .
It might be nice to consider Solr for indexing and searching. You can use the Solr Cell plugin to index and search Word, PDF, and other documents. I successfully use it in one of my projects. Solr Cell is based on several projects such as Apache POI , Tika, and PDFBox .
The tricky part is to configure all the cell-dependent jars and the solr scheme, as well as find out the indexing request parameters, but you can think of all this from the wiki documentation. Here are my banks and the scheme, so you get started, the corresponding part of the scheme is the line containing the "attachment".
Solr Cell does not perform OCR. You must first use the OCR Engine to make them searchable.
For OCR, you can use the OpenSource Engine Tesseract, which is developed by Google, or you can see the commercial engine Abbyy , Both of them are used as utility utils, which you can run from your php scripts. To get comparable results from Tesseract as from Abbyy, you will need to do preliminary and post-procession 1 . There are also cloud services that may be an easier option. For example, Wisetrend and Abbyy Cloud . The latter is in beta at the moment, so it is free, and it has ready-to-use PHP code examples .
clyfe source share