Given the title of the question: Apache Tika did a great job of extracting plain text from PDF. I did not use it to get text from tables.
PDF PDFBox. PDF , Microsoft Word (doc docx), Excel PowerPoint, OpenOffice.org/LibreOffice ODT, HTML, XML . AutoDetectParser .
(, Mahout ), ParsingReader, Reader, . , , :
public Reader getPlainTextReader(final InputStream is) {
try {
Detector detector = new DefaultDetector();
Parser parser = new AutoDetectParser(detector);
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
Metadata metadata = new Metadata();
Reader reader = new ParsingReader(parser, is, metadata, context);
for (String name : metadata.names()) {
for (String value : metadata.getValues(name)) {
logger.debug("Document {}: {}", name, value);
}
}
return reader;
} catch (IOException e) {
...
}
}