Is there any solution to know the similarity of the two pdfs without comparing the contents

I want to know the similarity of pdf files, but I do not want to compare the content. is there any solution only from its external structure. Is it possible? thanks!

+3
source share
3 answers

It sounds potentially tough, but here are some low-power rewards from PDF metadata in order of complexity.

  • Document metadata such as eBook-titleandTitle
  • Number of pages in a document (counting /Pagedirectives)
  • Compare the metadata for each page, such as MediaBox, CropBox, BleedBox,TrimBox
  • , , , , .
  • : , .. PDF , strings Linux. (blah blah blah) Tj, PDF-.

, , GhostScript , . , , 100px, .

PDF, ! ( ), . PDF HTML PDF.

+3

, , (, md5), .

diff, , , , , , .

pdf. , - , .

0

PDF is not just a text file. Its binary dump is a B-tree. Using compressed objects, you can also get object data compressed inside other binary objects so that they are not visible.

If you want to perform low-level text manipulation, you really need to use a decent tool. Acrobat 9.0 has a menu item for viewing the internal structure of a PDF, or you can use something like IText.

0
source

Source: https://habr.com/ru/post/1705545/


All Articles