How to create a hash similar to a similar input?

I want to create a database with files. And, to easily find these files, I want to use some kind of hashing method. However, I not only want to find files that are EXACTLY the same, but also check if parts of the files are the same (i.e. the files are similar). in other words, similar files should have similar hashes.

This means that this hash is not a cryptographic hash because there should be no “avalanche effect” (the avalanche effect means that every bit of data affects ALL other bits of other data.)

Another thing is that the hash should not be one-way, since it is not used for securitypurpose, but for file comparison.

So, essentially, I'm looking for an algorithm that can create a unique hash for each unique input, which:

  • Has (almost) no collision

  • Generates a similar output for similar inputs.

  • Shorter than the source file (otherwise it would be easier to simply compare the source files).

I was thinking of something like adding the first two characters together, and then adding 3rd and 4th place together, etc. However, it has a HUGE amount of collisions, since "1 + 4" matches "2 + 2", etc.

I really don't know how to start. Can someone enlighten me please? :)

+6
source share
2 answers

This is usually called a problem next to duplicate detection and is not easy to solve; I would recommend the simhash algorithm (code here ).

+3
source

I am currently using ssdep to achieve the same effect, and I am getting pretty good results.

I also read that sdhash is better than ssdep.

+1
source

Source: https://habr.com/ru/post/902463/


All Articles