How much data (lots of MB) I can uniquely identify with MD5

I have millions of data records about 2 MB in size. Each of these pieces of data is stored in a file, and there is a set of other data associated with this record (stored in the database).

When my program starts, I will be represented in the memory of one of the data records and will have to create the associated data. To do this, I suppose to take MD5 from memory, and then use this hash as a key in the database. The key will help me find other data.

What do I need to know if the MD5 hash of the data content is a suitable way to uniquliy identify a piece of data of 2 MB, that is, can I use the MD5 hash without worrying too much about collisions?

I understand that there is a chance of a collision, I worry how likely it is to collide with millions of 2 MB data records? Is a collision likely? How about being compared to a hard drive crash or other computer crashes? How much MD5 data can be used for secure identification? What about millions of GB files?

I don't care about anger or falsification of data. I have protection, so I do not get managed data.

+3
source share
1 answer

It comes down to the so-called birthday paradox . This Wikipedia page has simplified formulas for estimating collision probability. This will be a very small number.

: 10 -12 - . .

+3

Source: https://habr.com/ru/post/1778689/


All Articles