I am trying to create an application that will be basically the catalog of my PDF collection. We are talking about 15-20 GB containing tens of thousands of PDF files. I also plan to include a full-text search engine. I will use Lucene.NET for search (in fact, NHibernate.Search) and a library for converting PDF-> text. What would be the best choice? I reviewed this:
- Pdfbox
- pdftotext (from xpdf) via C # wrapper
- iTextSharp
Edit: Another good option is to use iFilters. How well (speed / quality) will they perform (Foxit / Adobe) compared to these libraries?
Commercial libraries are probably out of the question, as this is my private project, and I don't have a budget for commercial solutions, although PDFTextStream looks very good.
From the fact that I read pdftotext is faster a lot than PDFBox. How well does iTextSharp compare to pdftotext? Or maybe someone can recommend other good solutions?
source
share