PDF for a text tool or Java library?

I need to convert the PDF to plain text (this is a "vote" from our county registrar). Files are large (2000 pages or so) and mostly contain tables. As soon as I get it into text, I am going to use the program that I am writing to parse it and put the data in the database. I tried the Save As Text feature in Adobe Reader, but it’s not as accurate as I would like, especially to delimit table data in CSV. So, any recommendations for Java tools or libraries that could do the trick?

+3
source share
7 answers

Well, there is iText . I have only limited experience with him, but it seems he can do what you want.

Apache PDFBox can certainly do this. His site mentions "PDF to text extract" as its main feature. There, the ExtractText command-line tool specifically for this ( source code ), based on the PDFTextStripper class . And here is the PDFBox Text Retrieval Guide !

+6
source

Given the title of the question: Apache Tika did a great job of extracting plain text from PDF. I did not use it to get text from tables.

PDF PDFBox. PDF , Microsoft Word (doc docx), Excel PowerPoint, OpenOffice.org/LibreOffice ODT, HTML, XML . AutoDetectParser .

(, Mahout ), ParsingReader, Reader, . , , :

public Reader getPlainTextReader(final InputStream is) {
    try {
        Detector detector = new DefaultDetector();
        Parser parser = new AutoDetectParser(detector);
        ParseContext context = new ParseContext();
        context.set(Parser.class, parser);
        Metadata metadata = new Metadata();

        Reader reader = new ParsingReader(parser, is, metadata, context);

        for (String name : metadata.names()) {
            for (String value : metadata.getValues(name)) {
                logger.debug("Document {}: {}", name, value);
            }
        }

        return reader;

    } catch (IOException e) {
        ...
    }
}
+5

xpdf.

pdf - PDF EDI. , .

+2

PDFTextStream - Java +.NET PDF-; . , , PDFTextStream. ( !), (, , ..), , .

+1

() .

0

iText, . xmlpdf , iText .

0

PDF-, .

iText, PDBox. - , < 30 , , Java.

PDFBox, , iText.

Someone else mentioned xpdf and this might be useful for you. This is a C library with some command line tools built into it. It has a number of text extractors, and you can easily format the output. Again, this really depends on your page layout.

0
source

Source: https://habr.com/ru/post/1703775/


All Articles