How to index pdf, ppt, xl files in lucene (based on java or python or php any of them is ok)?

Question

How to index pdf, ppt, xl files in lucene (based on java or python or php any of them is ok)?

I also want to know how to add metadata during indexing so that I can raise some parameters

+3

java indexing lucene

harsha Apr 6 '10 at 6:03

source share

4 answers

There are several text extraction frameworks suitable for indexing Lucene from text files (pdf, ppt, etc.)

One of them is Apache Tika , a subproject of Lucene.
Apache POI - Apache.
.

+4

Yuval F 06 . '10 7:56

Apache Tika. Tika - .

XML
Microsoft Office
OpenDocument
Rich Text
Java-
mbox

The code will look as follows. Reader reader = new Tika (). Parse (stream);

+2

Sergey Kabashnyuk Apr 16 '10 at 14:04

source share

see https://github.com/WolfgangFahl/pdfindexer for a Java solution that uses PDFBox and Apache Lucene to split PDF files by page by text, index these text pages and create the resulting html index file that links to the pages in the sources pdf using the corresponding public parameter.

+1

Wolfgang fahl May 12, '13 at 7:44

source share

Michael shimmins · Accepted Answer · 2010-04-06T06:11:35+0000

Lucene indexes text without files - you will need another process to extract text from a file and execute Lucene on it.

How to index pdf, ppt, xl files in lucene (based on java or python or php any of them is ok)?

More articles: