Find Duplicate PDF Files

I am looking for a utility that will help me find duplicate PDF files. Problem: I have 1000 PDF files. Some of them are duplicates. It is not easy for them to detect different file names and small differences in file size. Is there a utility / algorithm / library that can help me find duplicates or show me files that are very similar (or degree of difference)?

+3
source share
5 answers

If the files were created by different tools, they may look the same, but generate very different results, because they are structured in a completely different way. I made some suggestions in a blog article https://blog.idrsolutions.com/2010/09/comparing-2-pdf-files/

+2
source

DiffPDF looks like something that can help you.

+1
source

MD5 . .

+1

, UNIX pdf2txt (. poppler-utils). diff.

+1

. , fdupes http://premium.caribe.net/~adrian2/fdupes.html, .

, , . , , perl- script, : http://seegras.discordia.ch/Programs/fileindex, - md5- ~/.fileindex.md5 PDF - ( fileindex), , , md5- , , , , .

There are also exif-meta and exif-rename at http://seegras.discordia.ch/Programs/ , which help with setting PDF metadata and renaming PDF files according to metadata; and if you mark all the files correctly, you will get duplicate file names, indicating that they can be the same document in another file.

+1
source

Source: https://habr.com/ru/post/1767760/


All Articles