PDF bullets come as question marks when parsing with Apache Tika in java

I am parsing PDF files using Apache Tika (tika-app-1.3) using this code:

InputStream input = new FileInputStream("Introduction.pdf");  
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(100 * 1024 * 1024);
Metadata metadata = new Metadata();
parser.parse(input, handler, metadata);
System.out.println(handler.toString());

handler.toString()displays the text in PDF format, but this text also contains markers that appear as a symbol ?, but I want these cartridges to be as they are. Is there a way to get the original, how is the content using Apache Tika? Or where coding is needed in the analysis?

+1
source share
1 answer

, , , , , , , , . . .

- . , ? (U + 003F) .

, PDF .

0

Source: https://habr.com/ru/post/1609012/


All Articles