I am using PDFBOX version 2.0.0 in a Java project to convert PDF files to text.
Several of my PDFs do not have the ToUnicode method, so they appear in Gibberish while I export them.
2016-09-14 10:44:55 WARN org.apache.pdfbox.pdmodel.font.PDSimpleFont(1):322 - No Unicode mapping for 694 (30) in font MPBAAA+F1
in WARN above, instead of a real character, unicode gibberish (30) was introduced.
I managed to overcome this by editing the file additional.txtin pdfbox, as from trial and error I realized that the character code (694 in this case) is a certain Hebrew letter (צ).
here is a short example of what i edited inside the file:
-694;05E6 #HexaDecimal value for the letter צ
-695;05E7
-696;05E8
later I came across almost the same warning in a different pdf format, but instead of gibberish characters I have no characters at all. a more detailed explanation of this problem can be seen here - pdf reading via pdfbox in java
2016-09-14 11:07:10 WARN org.apache.pdfbox.pdmodel.font.PDType0Font(1):431 - No Unicode mapping for CID+694 (694) in font ABCDEE+Tahoma,Bold
As you can see, the warning came from another class ( PDType0Font) instead of the first warning ( PDSimpleFont), but the code name (694) is the same in both of them, and they both talk about the same character.
Is there any other file that I have to modify, except additional.txtto point the code 694 (the Hebrew letter צ) to the correct unicode?
thank


source
share