What is a good method for extracting text from a PDF using C # or classic ASP (VBScript)?

Question

What is a good method for extracting text from a PDF using C # or classic ASP (VBScript)?

Is there a good library for extracting text from PDF? I am willing to pay for it if necessary.

Something that works with C # or classic ASP (VBScript) would be ideal, and I would also need to separate the pages from the PDF.

This question had interesting material, especially pdftotext , but I would like to avoid invoking an external command-line application if possible.

+4

pdf text-extraction pdf-scraping

Mark biek Sep 05 '08 at 20:55

source share

5 answers

Here is a good list: Open Source Libs for PDF / C #

Most of them are designed to create PDF files, but they should also be readable.

There is this one: iText

I only played with iText before. Nothing serious.

0

Doanair Sep 05 '08 at 9:03

source share

We used Aspose with good results.

0

Chuck Sep 05 '08 at 21:23

source share

The Docotic.Pdf library can be used to extract formatted or plain text from PDF documents.

The library can read PDF documents of any version (up to the latest published standard). Page extraction is also supported by the library.

Links to sample code:

Disclaimer: I work for a library provider.

0

Bobrovsky Jan 21 '12 at 22:22

source share

Addition to the approved answer: there are also alternative commercial solutions for replacing Adobe IFilter for indexing text (providing a similar API, but also offering additional premium functionality):

Foxit PDF IFilter : Provides much faster text indexing than the Adobe plugin.
PDFLib PDF iFilter : includes support for corrupted PDF documents plus an additional API to run your own queries.

If you are looking for one tool that can be used from both managed .NET applications and legacy programming languages such as classic ASP or VB6, then this means that the commercial ByteScout PDF Extractor SDK will meet the requirements of both .NET and ActiveX / COM API.

Disclaimer: I work for ByteScout

0

Eugene m Feb 24 '15 at 11:43

source share

Ferruccio · Accepted Answer · 2008-09-05T21:12:38+0000

You can use the IFilter interface built into Windows to extract text and properties (author, title, etc.) from any supported file type. This is a COM interface, so you would use .NET interoperability.

You also need to download the free PDF IFilter driver from Adobe.

What is a good method for extracting text from a PDF using C # or classic ASP (VBScript)?

More articles: