PDF for a text tool or Java library?

Question

PDF for a text tool or Java library?

I need to convert the PDF to plain text (this is a "vote" from our county registrar). Files are large (2000 pages or so) and mostly contain tables. As soon as I get it into text, I am going to use the program that I am writing to parse it and put the data in the database. I tried the Save As Text feature in Adobe Reader, but it’s not as accurate as I would like, especially to delimit table data in CSV. So, any recommendations for Java tools or libraries that could do the trick?

+3

java pdf

Gary kephart Feb 24 '09 at 21:07

source share

7 answers

() .

0

dirkgently 24 . '09 21:11

Michael myers · Answer 1 · 2009-02-24T21:11:14+0000

Well, there is iText . I have only limited experience with him, but it seems he can do what you want.

Apache PDFBox can certainly do this. His site mentions "PDF to text extract" as its main feature. There, the ExtractText command-line tool specifically for this ( source code ), based on the PDFTextStripper class . And here is the PDFBox Text Retrieval Guide !

Arjan · Answer 2 · 2012-08-28T17:26:09+0000

Given the title of the question: Apache Tika did a great job of extracting plain text from PDF. I did not use it to get text from tables.

PDF PDFBox. PDF , Microsoft Word (doc docx), Excel PowerPoint, OpenOffice.org/LibreOffice ODT, HTML, XML . AutoDetectParser .

(, Mahout ), ParsingReader, Reader, . , , :

public Reader getPlainTextReader(final InputStream is) {
    try {
        Detector detector = new DefaultDetector();
        Parser parser = new AutoDetectParser(detector);
        ParseContext context = new ParseContext();
        context.set(Parser.class, parser);
        Metadata metadata = new Metadata();

        Reader reader = new ParsingReader(parser, is, metadata, context);

        for (String name : metadata.names()) {
            for (String value : metadata.getValues(name)) {
                logger.debug("Document {}: {}", name, value);
            }
        }

        return reader;

    } catch (IOException e) {
        ...
    }
}

Jarod Elliott · Answer 3 · 2009-02-24T21:14:40+0000

xpdf.

pdf - PDF EDI. , .

cemerick · Answer 4 · 2009-12-07T13:47:56+0000

PDFTextStream - Java +.NET PDF-; . , , PDFTextStream. ( !), (, , ..), , .

SacramentoJoe · Answer 5 · 2009-02-24T23:25:37+0000

iText, . xmlpdf , iText .

Steve claridge · Answer 6 · 2009-02-24T23:58:12+0000

PDF-, .

iText, PDBox. - , < 30 , , Java.

PDFBox, , iText.

Someone else mentioned xpdf and this might be useful for you. This is a C library with some command line tools built into it. It has a number of text extractors, and you can easily format the output. Again, this really depends on your page layout.

PDF for a text tool or Java library?

More articles: