What are the chances that two messages have the same MD5 digest and the same SHA1 digest?

Question

What are the chances that two messages have the same MD5 digest and the same SHA1 digest?

Given two different messages: A and B (maybe 20-80 characters of text, if size matters at all), what is the likelihood that MD5 digest A will be the same as MD5 digest from B and SHA1 digest from A is the same How is SHA1 digest B? I.e:

(MD5(A) == MD5(B)) && (SHA1(A) == SHA1(B))

Avoid malicious intent, i.e. that messages are not selected for collision detection. I just want to know that this happens naturally.

I think the odds are "astronomically low", but I'm not sure how to check it.

Additional information: the size of the pool of possible messages is limited, but large (several hundred million). Situations with the birth paradox are exactly what I'm worried about.

+46

math hash-collision md5 sha1 digest

John Siracusa Aug 24 '09 at 15:17

source share

5 answers

Adding to the Welbog post:

The ratios of large factorials can be calculated without using arithmetic of arbitrary accuracy using the Stirling approximation :

P! & Asymp; sqrt (2? pi) * (n / e) ⁿ

So (S!) / (S ^ N * (S - N)!) & Asymptotics; SQRT (2 S S) / SQRT (2 ((SN)) * (S / e) ^S / ((SN), / e) ^SN / S ^N

= sqrt (S / (SN)) * (S / (SN)) ^SN * e ^-N

= sqrt (1 +?) * (1 +?) ^SN * e ^-N where? = N / (SN) is small.

Approximation (1 + a / n) ^nx & asympt. e ^ax runs as n → & INFIN; (or at least getting very big)

**, therefore it means (1+ (N / (SN))) ^SN & asympt. e ^N for SN → N.

Therefore, I would expect that

(S!) / (S ^ N * (S - N)!) & Asymptotics; sqrt (1 + N / (SN)) * e ^N * e ^-N = sqrt (1 + N / (SN)) for SN → N ....

except that it is more than 1 ... therefore one of the approximations is not enough .: P

(** caveat: N / S should be small: for N = 22, S = 365 this is disabled 2 times)

+6

Jason S Aug 24 '09 at 19:36

source share

If the message size is not limited, the probability approaches 100% asymptotically, since there are an infinite number of possible messages and a finite number of possible hashes.

(note: editing the question makes this less relevant now)

+4

ceejayoz Aug 24 '09 at 15:35

source share

Typically, when one randomly selects N elements, it is easier to calculate the expected number of collisions than the probability of a collision. Since the expected number of collisions cannot be less than the probability of a collision, it can often be used as a suitable upper bound.

Suppose p is the probability of a collision of two randomly selected elements. If we choose N random elements, then there is N * (N-1) / 2 pair of elements and, therefore, the expected number of collisions will be

p * N * (N-1) / 2.

For example, if we assume that the collision probability for both MD5 and SHA1 is p = 2 ^-288, then even after a random selection of ^2,100 elements, we still only expect about 2 ^-89 collisions.

Another example: if you select 2 ³⁰ random elements and calculate only MD5. Assuming a collision between two MD5 hashes is p = 2 ^-128, this gives an expected number of 2 ^-59 for the number of collisions. Therefore, even the probability that the MD5 hash collides for two inputs is already very small.

+1

Accipitridae Aug 26 '09 at 19:15

source share

The selected answer is incorrect because it uses incorrect probabilities. I spent most of the day studying this (you can sort my thought process in the comments on this answer) and assume that the actual answer is the following (for attacking your birthday with a little larger messages than the ones you are talking about)

2 ^ -61 * 2 ^ -18 = collision once every 2 ^ 79.

And if it's just to just multiply these probabilities (I'm not sure about that).

This is feasible (in less than a couple of months and every time reduced) by supercomputers today.

Note that this is based on large enough message pools (to make the birthday paradox meaningful). This is also the scenario you talked about that is bothering you.

Now another situation is collision detection for a pair of hashes (SHA1 and MD5) of a particular message. This takes you out of the realm of the bday paradox and several orders of magnitude. I'm not sure if it is 2 ^ (- 61 * 2) * 2 ^ (- 18 * 2) or something else. If anyone knows what it is, post a comment on this answer (would be appreciated!).

Now you are asking:

Given two different messages: A and B (maybe 20-80 characters of text, if size matters at all)

Yes, size matters. Click on the link to the number 2 ^ -18, and you will see that this value is for two input blocks. In MD5, the input block is 512 bytes. 2080 characters of text are too small for this, and a single-block value is 2 ^ 41.

So for this amount of data you get 2 ^ -61 (I think) * 2 ^ -41 = 2 ^ -102.

Thus, for this size it seems safe (the link contains a two-point hash digit of the SHA256 bitcoin: 46626.93 TH / sec).

+1

Greg Slepak Apr 12

source share

Welbog · Accepted Answer · 2009-08-24 15:27

Assuming uniform distribution in the range of MD5 and SHA-1 hashes for random strings (this is not the case), and suppose we are talking only about two strings and not talking about a string pool (therefore, we avoid birthday-paradoxical complexity):

The MD5 hash is 128 bits wide, and SHA-1 is 160. Given the above assumptions, two lines A and B have a collision probability P if both hashes collide. So,

 P(both collide) = P(MD5 collides) * P(SHA-1 collides)

AND

 P(MD5 collides) = 1/(2^128) P(SHA-1 collides) = 1/(2^160)

So,

 P(both) = 2^-128 * 2^-160 = 2^-288 ~= 2.01 x 10^-87

Again, if you have a row pool and you are trying to determine the probability of collisions with the pool, you are in the domain of the birthday paradox , and this probability, which I calculated here, does not apply. This and the hashes are not as uniform as they should be. In fact, you will have a much higher level of collisions, but it will still be tiny.

EDIT

Since you are dealing with a paradoxical birthday situation, use the same logic as solving the paradoxical birthday. Let's look at this in terms of only one hash function:

 N := the number of hashes in your pool (several hundred million) S := the size of your hash space (2^288) Therefore, P(There are no collisions) = (S!)/(S^N * (S - N)!)

Suppose we have a good even number of hashes, such as 2 ^ 29 (approximately 530 million).

 P = (2^288!)/(2^288^(2^29) * (2^288 - 2^29)!)

In short, I don’t even want to think about calculating this number. I don’t even know how you can evaluate it. At the very least, you need an arbitrary precision calculator that can handle huge factorials without dying.

Note that the probabilities will follow a curve starting at almost 0 for N = 1 or 2 , and it will reach 1 when N >= 2^288 , in form similar to the one on the Wikipedia page for the paradox of the day.

The birthday paradox reaches P = .5 when N = 23 . In other words, the chance of collision is 50% when N is 6% of S. If it scales (I'm not sure it is), it means that the probability of collision will be 50% if you have 6% of 2 ^ 288 hashes. 6% of 2 ^ 288 is around 2,284. Your N value (several hundred million) is nowhere near. This is practically negligible compared to your S, so I don’t think you have anything to worry about. Collisions are not very likely.

What are the chances that two messages have the same MD5 digest and the same SHA1 digest?

More articles: