Fastest PDF-> text library for .NET project

I am trying to create an application that will be basically the catalog of my PDF collection. We are talking about 15-20 GB containing tens of thousands of PDF files. I also plan to include a full-text search engine. I will use Lucene.NET for search (in fact, NHibernate.Search) and a library for converting PDF-> text. What would be the best choice? I reviewed this:

  • Pdfbox
  • pdftotext (from xpdf) via C # wrapper
  • iTextSharp

Edit: Another good option is to use iFilters. How well (speed / quality) will they perform (Foxit / Adobe) compared to these libraries?

Commercial libraries are probably out of the question, as this is my private project, and I don't have a budget for commercial solutions, although PDFTextStream looks very good.

From the fact that I read pdftotext is faster a lot than PDFBox. How well does iTextSharp compare to pdftotext? Or maybe someone can recommend other good solutions?

+3
source share
3 answers

If this is for a private project, will this be an ongoing conversion process? For instance. after you converted 15-20Gb, are you going to convert?

, , , , . , , , , . /-weekend, !

+3

Foxit PDF IFilter

http://www.foxitsoftware.com/pdf/ifilter/

, , , . , , , , , .

Foxit PDF Reader/Text Extraction, , Foxit.

+1

, , 20Gb ?

, sqlite , pdf , .

:

Table: PDFFiles
PDFFileID
PDFFilePath
PDFTitle
PDFAuthor
PDFKeywords
PDFFullText....

, , pdf, pdf , PDF .

0

Source: https://habr.com/ru/post/1755951/


All Articles