Copy paste from PDF is gibberish in the source file, but fixed when printing PDF using CutePDF

I have this pdf file which is in Greek. A known problem occurs when you try to copy and paste text from it, which will lead to a little gibberish. The reason I speak more rather than the outcome is that, although the enclosed result does not make sense in Greek, it consists of real Greek characters. In addition, an interesting aspect of the problem is that not all characters are displayed correctly. For example, if you are comparing a source strip of text

ΕΞ. ΕΠΕΙΓΟΝ – ΑΜΕΣΗ ΕΦΑΡΜΟΓΗ
ΝΑ ΣΤΑΛΕΙ ΚΑΙ ΜΕ Ε-ΜΑIL

with pasted in pdf:

ΔΞ. ΔΠΔΙΓΟΝ – ΑΜΔ΢Η ΔΦΑΡΜΟΓΗ
ΝΑ ΢ΣΑΛΔΙ ΚΑΙ ΜΔ Δ-ΜΑIL

You will notice that some of the characters are correctly inserted, while others are not. It may also be useful to mention that incorrect characters are not reflectively displayed correctly, for example. Ε becomes Δ and vice versa.

When I open a PDF using, for example, Adobe and print it using a PDF, in this case CutePDF, the output when copying and pasting is correct!

Given the above, my questions are as follows:

  • What is the reason for this behavior?
  • How do I integrate a solution into a java based workflow for randomly imported PDF files?

EDIT: multiple typos

+4
source share
1 answer

Some basic conditions:

PDF . . . , .

ToUnicode CMap.

, E, :

[0x01FC, ...] TJ

CMP ToUnicode :

4 beginbfrange
<01f9> <01fc> <0391>
...
endbfrange

, 0x01F9, 0x01FA, 0x01FB 0x01FC Unicode U+0x391, U+0x392, U+0x393 U+0x394 .

U + 0394 - Δ, /.

0x0204. ToUnicode <0200> <020b> <039a>, U + 039E

, , Unicode . , . . .

+2

Source: https://habr.com/ru/post/1609984/


All Articles