I have millions of data records about 2 MB in size. Each of these pieces of data is stored in a file, and there is a set of other data associated with this record (stored in the database).
When my program starts, I will be represented in the memory of one of the data records and will have to create the associated data. To do this, I suppose to take MD5 from memory, and then use this hash as a key in the database. The key will help me find other data.
What do I need to know if the MD5 hash of the data content is a suitable way to uniquliy identify a piece of data of 2 MB, that is, can I use the MD5 hash without worrying too much about collisions?
I understand that there is a chance of a collision, I worry how likely it is to collide with millions of 2 MB data records? Is a collision likely? How about being compared to a hard drive crash or other computer crashes? How much MD5 data can be used for secure identification? What about millions of GB files?
I don't care about anger or falsification of data. I have protection, so I do not get managed data.
source
share