How to index pdf, ppt, xl files in lucene (based on java or python or php any of them is ok)?

I also want to know how to add metadata during indexing so that I can raise some parameters

+3
source share
4 answers

Lucene indexes text without files - you will need another process to extract text from a file and execute Lucene on it.

+1
source

There are several text extraction frameworks suitable for indexing Lucene from text files (pdf, ppt, etc.)

+4

Apache Tika. Tika - .

  • XML
  • Microsoft Office
  • OpenDocument
  • Rich Text
  • Java-
  • mbox

The code will look as follows. Reader reader = new Tika (). Parse (stream);

+2
source

see https://github.com/WolfgangFahl/pdfindexer for a Java solution that uses PDFBox and Apache Lucene to split PDF files by page by text, index these text pages and create the resulting html index file that links to the pages in the sources pdf using the corresponding public parameter.

+1
source

Source: https://habr.com/ru/post/1739946/


All Articles