I have an archive of about 100 million binary files. New files are regularly added. File sizes range from about 0.1 MB to about 800 MB.
I can easily determine if it is possible that the files are probably completely identical by comparing their sizes and if the sizes are the same by comparing file hashes.
I want to find files with partially similar content. With this, I mean that I believe that they have parts that are identical, and some parts that can be different.
What is the best or any realistic way to find files similar to other files, and, if possible, get some information about how similar they are?
Edit:
Files are mostly executable. They are similar if, say, somewhere between 10% and 100% of their contents match the contents of another file. The lower limit can also be set to 50%. The exact lower limit is not important. I suppose that for such a comparison some form of hashing will be needed so that this comparison can be performed on such an archive.
Tomsv source
share