How to extract / recognize text from documents?

Question

How to extract / recognize text from documents?

I need to extract text from uploaded documents to make them searchable. Documents can be MS Word or pdf (scanned or containing text). This application runs on the LAMP stack, but installing other software may be an option. Is there any tool, service, library, or a combination of those that you could recommend for this task?

+4

php ms-word pdf ocr lamp

Maarten Dec 22 '11 at 19:54

source share

3 answers

As far as I know, little can be done with OCR in PHP. The best solution would be to use a cloud service - a web api that allows you to upload an image and send you OCR data. Try www.ocrsdk.com , the cloud-based OCR SDK recently launched by ABBYY. Now it is in beta, so it is completely free to use and has ready-to-use PHP code examples . Disclamer: I work @ABBYY

+3

Nikolay Dec 23 '11 at 8:26

source share

I do not know any software that converts PDF to text, but for the MS Word part, you can use Apache POI: http://poi.apache.org/ , which is built in JAVA, so you will need to execute the binary from your PHP file to make it work.

Another option is to use the JODConverter (which I am currently using for this purpose) http://code.google.com/p/jodconverter/ Therefore, if the Apache POI does not work, I know what jodconverter does. I am using 3.0 beta.

In my PHP code, I save the download and runtime binary to file file in the TMP directory, which will create a new file in the TMP directory and I pull the plain text from the new file.

0

Francis lewis Dec 22 '11 at 20:01

source share

clyfe · Accepted Answer · 2011-12-22T20:01:53+0000

You can use a combination of shell utilities such as pdftotext for PDF files, wvWare for DOC, docx2txt.pl for DOCX, for example textractor rubygem does.

 # on Ubuntu apt-get install wv xpdf-utils links

There are also native php classes for extracting PDF and docx .

Another rubigem that even does OCR for you, though Tesseract, docsplit .

It might be nice to consider Solr for indexing and searching. You can use the Solr Cell plugin to index and search Word, PDF, and other documents. I successfully use it in one of my projects. Solr Cell is based on several projects such as Apache POI , Tika, and PDFBox .

The tricky part is to configure all the cell-dependent jars and the solr scheme, as well as find out the indexing request parameters, but you can think of all this from the wiki documentation. Here are my banks and the scheme, so you get started, the corresponding part of the scheme is the line containing the "attachment".

Solr Cell does not perform OCR. You must first use the OCR Engine to make them searchable.

For OCR, you can use the OpenSource Engine Tesseract, which is developed by Google, or you can see the commercial engine Abbyy , Both of them are used as utility utils, which you can run from your php scripts. To get comparable results from Tesseract as from Abbyy, you will need to do preliminary and post-procession 1 . There are also cloud services that may be an easier option. For example, Wisetrend and Abbyy Cloud . The latter is in beta at the moment, so it is free, and it has ready-to-use PHP code examples .

How to extract / recognize text from documents?

More articles: