Since you marked md5, I will use this as an example. From wikipedia :
if two prefixes with the same hash can be created, a common suffix can be added to both so that the chance that the collision will be accepted as valid data by the application using it. In addition, current collision detection methods allow you to specify an arbitrary prefix: an attacker can create two counter files that start with the same content. All attackers must generate two colliding files - this is a template file with a 128-byte data block aligned on a 64-byte border, which can be freely changed by the collision search algorithm. For example, an MD5 collision with two messages differing in 6 bits:
And then the example they give is 256 bytes. Since the collision attack is based on a 128-byte data block and the hash digest is only 128 bits, there is really no increased risk of a collision attack that goes beyond the first iteration, that is, you cannot really affect the probability of a collision outside the first hash .
Also note that hash entropy is the aforementioned 128 bits. Even considering that the total probability of a collision is only 2 ^ 20.96 (again from wikipedia ), it will take many iterations to trigger two inputs for a collision. The first thing I think you are a victim of is:
- Any two arbitrary inputs have a probability of collision x%.
- The outputs of the first hash are two such inputs.
- Therefore, each iteration increases the probability of a collision by x%.
This can be quite easily refuted by a counterexample. Consider MD5 again:
- The probability of a collision between two inputs is 1: 2 ^ 21 (taking into account the worst-case scenario from MD5 cryptographic analysis).
- Hashing again causes an equally probable collision probability, so the probability of a collision in the second round is 1: 2 ^ 20
- Thus, for any two inputs hashed several times equal to the entropy of the digest, it is guaranteed that they collide.
MD5 any two inputs 128 times in a row, and you will see that this is not true. You probably won't find any duplicate hashes between them - in the end, you created only 256 of the possible hash values โโof 2 ^ 128, leaving the possibilities of 2 ^ 120. The probability of collisions between each round is independent of all other rounds.
There are two approaches to understand why this is so. First, each iteration essentially tries to hit a moving target. I think you could build a proof based on the paradox of a birthday that there is a surprisingly small number of hash iterations, where you are likely to see one hash digest from one input matching the hash digest of the other input. But they almost certainly occurred at different stages of the iteration. And once this happens, they can never have the same output in the same iteration, because the hash algorithm itself is deterministic.
Another approach is to understand that the hash function actually adds entropy during its launch. Note that an empty string has a 128-bit digest, like any other input; what cannot happen without adding entropy during the steps of the algorithm. This is actually an integral part of the cryptographic hash function: the data must be destroyed, otherwise the input can be restored from the digest. For inputs longer than a digest, yes, entropy is lost altogether; this should be in order to fit in the length of the digest. But entropy is also added.
I do not have exact numbers for other hash algorithms, but I think that all the points that I have made generalize to other hash functions and one-way / mapping functions.