PDF search and shared library

Question

PDF search and shared library

I am looking for a server side PDF library (or a command line tool) that can:

splits a multi-page PDF file into separate PDF-based files
PDF file search result

Examples:

Search "Page ???" template in the text and break the large PDF into 001.pdf, 002, pdf, ... ???. pdf

The server program scans the PDF file, searches for a search pattern, saves pages matching the patten, and saves the file to disk.

This will be fine with integration with PHP / Ruby. A command line tool is also available. This will be batch processing on the server side (linux or win32). GUI / login is not supported. I18n support will be nice but not demanding. Thanks ~

+3

search pdf

ohho Apr 21 '10 at 7:23

source share

4 answers

My company, Atalasoft , has just released some PDF tools that run on .NET. There is a text highlighting class that you can use to search for text and determine how you separate a document and a document class with a very high level that makes the separation trivial. Suppose you have Stream in your original PDF file and an increasingly ordered list that describes the start page of each split, then the code for creating your split files is as follows:

public void SplitPdf(Stream stm, List<int> pageStarts, string outputDirectory)
{
    PdfDocument mainDoc = new PdfDocument(stm);
    int lastPage = mainDoc.Pages.Count - 1;

    for (int i=0; i < pageStarts.Count; i++) {
        int startPage = pageStarts[i];
        int endPage= (i < pageStarts.Count - 1) ?
            pageStarts[i + 1] - 1 :
            lastPage;
        if (startPage > endPage) throw new ArgumentException("list is not ordered properly", "pageStarts");
        PdfDocument splitDoc = new PdfDocument();
        for (j = startPage; j <= endPage; j++)
            splitDoc.Pages.Add(mainDoc.Pages[j];

        string outputPath = Path.Combine(outputDirectory, 
                                         string.Format("{0:D3}.pdf", i + 1));
        splitDoc.Save(outputPath);
    }

if you generalize this to page range structure:

public struct PageRange {
    public int StartPage;
    public int EndPage;
}

StartPage EndPage , :

public void SplitPdf(Stream stm, List<PageRange> ranges, string outputDirectory)
{
    PdfDocument mainDoc = new PdfDocument(stm);

    int outputDocCount = 1;
    foreach (PageRange range in ranges) {
        int startPage = Math.Min(range.StartPage, range.EndPage); // assume not in order
        int endPage = Math.Max(range.StartPage, range.EndPage);
        PdfDocument splitDoc = new PdfDocument();
        for (int i=startPage; i <= endPage; i++)
            splitDoc.Pages.Add(mainDoc.Pages[i]);
        string outputPath = Path.Combine(outputDirectory, 
                                         string.Format("{0:D3}.pdf", outputDocCount));
        splitDoc.Save(outputPath);
        outputDocCount++;
    }
}

+4

plinth 23 . '10 12:58

PDFBox - Java, :

http://pdfbox.apache.org/

PDFBox , / PDFS

+2

Steve Claridge 23 . '10 14:36

pdfsam, , pdftotext ( foolabs.com), ruby ( grep), . .

+1

topskip 21 . '10 7:38

ohho · Accepted Answer · 2010-04-28T04:39:22+0000

pdfminer + match multiple lines in python

PDF search and shared library

More articles: