I am parsing PDF files using Apache Tika (tika-app-1.3) using this code:
InputStream input = new FileInputStream("Introduction.pdf");
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(100 * 1024 * 1024);
Metadata metadata = new Metadata();
parser.parse(input, handler, metadata);
System.out.println(handler.toString());
handler.toString()displays the text in PDF format, but this text also contains markers that appear as a symbol ?, but I want these cartridges to be as they are. Is there a way to get the original, how is the content using Apache Tika? Or where coding is needed in the analysis?
source
share