Decoding algorithm

I regularly receive PDF files. Encoding works as follows:

  • PDF files may display correctly in Acrobat Reader
  • select everything and copy the test through Acrobat Reader
  • and paste into a text editor
  • will show that the content is encoded

So the examples are:

13579 -> 3579;
hello -> jgnnq

this is basically the offset (possibly swap) of ASCII characters.

The question is how can I automatically find the offset when I have access to several samples. I cannot be sure if the encoding offset has changed. All I know is some text that usually (if not always) appears, for example. "Name:", "Summary:", "Total:" inside the PDF.

Thank!

edit: thanks for the feedback. I will try to break the question down into smaller questions:

1: (-) ?

+3
5

.

, +2 , ( +2 char)

h i j
e f g
l m n
l m n
o p q

1 2 3
3 4 5
5 6 7
7 8 9
9 : ;

,

>>> text='jgnnq'
>>> knowns=['hello', '13579']
>>>
>>> for i in range(-5,+5): #check -5 to +5 char code range
...     rot=''.join(chr(ord(j)+i) for j in text)
...     for x in knowns:
...         if x in rot:
...             print rot
...
hello
+5

PDF (, ) (, ..)?

, (, , ). , , , . . .

, , , 1000 , ( ) 127 . , , . .

, . ( ), .

-

, , , , , . , , , , "" , , .

, , , . , . , .

+3

, .

, , ( ), .

, .

, .

+1

, ( count : ).

: ? .

+1

PDF, Acrobat Reader? , PDF (, PDF Clown) , .

0

Source: https://habr.com/ru/post/1742771/


All Articles