How to add unicode to truetype0font on pdfbox 2.0.0?

I am using PDFBOX version 2.0.0 in a Java project to convert PDF files to text.

Several of my PDFs do not have the ToUnicode method, so they appear in Gibberish while I export them.

2016-09-14 10:44:55 WARN org.apache.pdfbox.pdmodel.font.PDSimpleFont(1):322 - No Unicode mapping for 694 (30) in font MPBAAA+F1

in WARN above, instead of a real character, unicode gibberish (30) was introduced.

I managed to overcome this by editing the file additional.txtin pdfbox, as from trial and error I realized that the character code (694 in this case) is a certain Hebrew letter (צ).

here is a short example of what i edited inside the file:

-694;05E6 #HexaDecimal value for the letter צ
-695;05E7
-696;05E8

later I came across almost the same warning in a different pdf format, but instead of gibberish characters I have no characters at all. a more detailed explanation of this problem can be seen here - pdf reading via pdfbox in java

2016-09-14 11:07:10 WARN org.apache.pdfbox.pdmodel.font.PDType0Font(1):431 - No Unicode mapping for CID+694 (694) in font ABCDEE+Tahoma,Bold

As you can see, the warning came from another class ( PDType0Font) instead of the first warning ( PDSimpleFont), but the code name (694) is the same in both of them, and they both talk about the same character.

Is there any other file that I have to modify, except additional.txtto point the code 694 (the Hebrew letter צ) to the correct unicode?

thank main root

expand font of the first type 0> </a> </p></div></body> </html>

+4
source share
1 answer

CMap ToUnicode . , , , . . , ( "Bedingungen" ).

:

  • Identity-H
  • ToUnicode
  • try (PDDocument doc = PDDocument.load(f))
    {
        for (int p = 0; p < doc.getNumberOfPages(); ++p)
        {
            PDPage page = doc.getPage(p);
            PDResources res = page.getResources();
            for (COSName fontName : res.getFontNames())
            {
                PDFont font = res.getFont(fontName);
                COSBase encoding = font.getCOSObject().getDictionaryObject(COSName.ENCODING);
                if (!COSName.IDENTITY_H.equals(encoding))
                {
                    continue;
                }
                // get real name
                String fname = font.getName();
                int plus = fname.indexOf('+');
                if (plus != -1)
                {
                    fname = fname.substring(plus + 1);
                }
                if (font.getCOSObject().containsKey(COSName.TO_UNICODE))
                {
                    continue;
                }
                System.out.println("File '" + f.getName() + "', page " + (p + 1) + ", " + fontName.getName() + ", " + font.getName());
                if (!fname.startsWith("Calibri-Bold"))
                {
                    continue;
                }
                COSStream toUnicodeStream = new COSStream();
                try (PrintWriter pw = new PrintWriter(toUnicodeStream.createOutputStream(COSName.FLATE_DECODE)))
                {
                    // "9.10 Extraction of Text Content" in the PDF 32000 specification
                    pw.println ("/CIDInit /ProcSet findresource begin\n" +
                            "12 dict begin\n" +
                            "begincmap\n" +
                            "/CIDSystemInfo\n" +
                            "<< /Registry (Adobe)\n" +
                            "/Ordering (UCS) /Supplement 0 >> def\n" +
                            "/CMapName /Adobe-Identity-UCS def\n" +
                            "/CMapType 2 def\n" +
                            "1 begincodespacerange\n" +
                            "<0000> <FFFF>\n" +
                            "endcodespacerange\n" +
                            "10 beginbfchar\n" + // number is count of entries
                            "<0001><0020>\n" + // space
                            "<0002><0041>\n" + // A
                            "<0003><0042>\n" + // B
                            "<0004><0044>\n" + // D
                            "<0013><0065>\n" + // e
                            "<0012><0064>\n" + // d
                            "<0017><0069>\n" + // i
                            "<001B><006E>\n" + // n
                            "<0015><0067>\n" + // g
                            "<0020><0075>\n" + // u
                            "endbfchar\n" +
                            "endcmap CMapName currentdict /CMap defineresource pop end end");
                }
                font.getCOSObject().setItem(COSName.TO_UNICODE, toUnicodeStream);
            }
        }
        doc.save("huhu.pdf");
    }
    

Btw 2.1 PDFDebugger , :

, , CMU ToUnicode . : enter image description here

+6

Source: https://habr.com/ru/post/1654542/


All Articles