Programmatically search multiple PDF files for a keyword and note page number

I work in a museum with hundreds of scientific pdf articles sitting in a catalog. I have OCR'd for everyone so they can search for keywords in programs like Adobe Reader. I need to write a program that will allow me to search for this directory for a specific view name and generate a list of documents matching the keyword and the number of the corresponding page.

I am looking for a PDF library in which I can accomplish this task with (hopefully) free. I wrote a small program using the PDFOne Library , but the search took about 10 minutes to find a single term in the directory. I would like to reduce the time significantly, because Adobe Reader and PDF-XchangeViewer can do the same search in less than a minute. I have no preference for using the language.

Can someone direct me to the right resources so that I can complete this task? Thanks.

+6
source share
1 answer

I suggest you evaluate the use of Apache Solr - which can index PDF files very efficiently.

http://lucene.apache.org/solr/

+2
source

Source: https://habr.com/ru/post/953618/


All Articles