Fastest PDF-> text library for .NET project

Question

Fastest PDF-> text library for .NET project

I am trying to create an application that will be basically the catalog of my PDF collection. We are talking about 15-20 GB containing tens of thousands of PDF files. I also plan to include a full-text search engine. I will use Lucene.NET for search (in fact, NHibernate.Search) and a library for converting PDF-> text. What would be the best choice? I reviewed this:

Pdfbox
pdftotext (from xpdf) via C # wrapper
iTextSharp

Edit: Another good option is to use iFilters. How well (speed / quality) will they perform (Foxit / Adobe) compared to these libraries?

Commercial libraries are probably out of the question, as this is my private project, and I don't have a budget for commercial solutions, although PDFTextStream looks very good.

From the fact that I read pdftotext is faster a lot than PDFBox. How well does iTextSharp compare to pdftotext? Or maybe someone can recommend other good solutions?

+3

c # pdf itextsharp pdfbox xpdf

n0e Jul 22 '10 at 10:29

source share

3 answers

Ray Hayes · Answer 1 · 2010-07-22T10:40:48+0000

If this is for a private project, will this be an ongoing conversion process? For instance. after you converted 15-20Gb, are you going to convert?

, , , , . , , , , . /-weekend, !

Lou Franco · Answer 2 · 2010-07-22T12:59:30+0000

Foxit PDF IFilter

http://www.foxitsoftware.com/pdf/ifilter/

, , , . , , , , , .

Foxit PDF Reader/Text Extraction, , Foxit.

Akash Kava · Answer 3 · 2010-07-22T10:55:35+0000

, , 20Gb ?

, sqlite , pdf , .

:

Table: PDFFiles
PDFFileID
PDFFilePath
PDFTitle
PDFAuthor
PDFKeywords
PDFFullText....

, , pdf, pdf , PDF .

Fastest PDF-> text library for .NET project

More articles: