Create a unique number based on line input in Javascript

In the past, I created a function that generates a unique identifier (number) from a string. Today I find that this is not as straightforward as it should be. Never seen a problem with this. Today, two different inputs generate the same identifier (number).

I use the same technique in Delphi, C ++, PHP and Javascript to generate the same identifier, so it makes no difference when different languages ​​are involved in the project. For example, it can be convenient for communication, for HTML id, tempfiles, etc.

In general, what I am doing is calculating CRC16 rows, adding the sum and returning it.

For example, these two lines generate the same identifier (number):

o.uniqueId( 'M:/Mijn Muziek/Various Artists/Revs & ElBee - Tell It To My Heart.mp3' ); o.uniqueId( 'M:/Mijn Muziek/Various Artists/Dwight Yoakam - The Back Of Your Hand.Mp3'); 

Both of them generate identifier 224904.

The following example is a javascript example. My question is: how can I avoid (with a little change) that it creates a duplicate? (In case you are interested in what β€œo.” Means, This is the object these functions relate to):

 o.getCrc16 = function(s, bSumPos) { if(typeof s !== 'string' || s.length === 0) { return 0; } var crc = 0xFFFF, L = s.length, sum = 0, x = 0, j = 0; for(var i = 0; i < L; i++) { j = s.charCodeAt(i); sum += ((i + 1) * j); x = ((crc >> 8) ^ j) & 0xFF; x ^= x >> 4; crc = ((crc << 8) ^ (x << 12) ^ (x << 5) ^ x) & 0xFFFF; } return crc + ((bSumPos ? 1 : 0) * sum); } o.uniqueId = function(s, bres) { if(s == undefined || typeof s != 'string') { if(!o.___uqidc) { o.___uqidc = 0; } else { ++o.___uqidc; } var od = new Date(), i = s = od.getTime() + '' + o.___uqidc; } else { var i = o.getCrc16(s, true); } return((bres) ? 'res:' : '') + (i + (i ? s.length : 0)); }; 

How to avoid duplicates using a small code change?

+4
source share
2 answers

Well, we checked it all and came to this. A relative short unique identifier created as follows:

 o.lz = function(i,c) { if( typeof c != 'number' || c <= 0 || (typeof i != 'number' && typeof i != 'string') ) { return i; } i+=''; while( i.length < c ) { i='0'+i; } return i; } o.getHashCode = function(s) { var hash=0,c=(typeof s == 'string')?s.length:0,i=0; while(i<c) { hash = ((hash<<5)-hash)+s.charCodeAt(i++); //hash = hash & hash; // Convert to 32bit integer } return ( hash < 0 )?((hash*-1)+0xFFFFFFFF):hash; // convert to unsigned }; o.uniqueId = function( s, bres ) { if( s == undefined || typeof s != 'string' ) { if( !o.___uqidc ) { o.___uqidc=0; } else { ++o.___uqidc; } var od = new Date(), i = s = od.getTime()+''+o.___uqidc; } else { var i = o.getHashCode( s ); } return ((bres)?'res:':'')+i.toString(32)+'-'+o.lz((s.length*4).toString(16),3); }; 

Examples:

 o.uniqueId( 'M:/Mijn Muziek/Various Artists/Revs & ElBee - Tell It To My Heart.mp3' ); o.uniqueId( 'M:/Mijn Muziek/Various Artists/Dwight Yoakam - The Back Of Your Hand.Mp3'); 

Will create the following identifiers:

 dh8qi9t-114 je38ugg-120 

For my purpose, this seems quite unique, and the extra length adds even more uniqueness. Test it on a file system with approximately 40,000 mp3 files and did not detect any collision.

If you think this is not the way, let me know.

+4
source

You must increase the number of bits generated by your hash function. Assuming your hash function is approximately uniform in space, you can mathematically obtain the probability of observing a collision.

This is strongly associated with a paradoxical day. In the case of CRC16, where the hash is 17 bits (although your implementation may have an error, I don’t see how you got 224094 , as it is more than 2^17 ), the chance of a collision is above 50% when stored more than about 2 ^ 8 items. In addition, CRC is not really a great hash function, as it means error detection rather than uniform hashing.

This table shows the mathematical probabilities of collision based on the hash length . For example, if you have a 128-bit hash key, you can save up to 10^31 elements before the chance of a collision exceeds 10^-15 . For comparison, this probability is lower than that of your hard drive, or maybe lightning is clogged on your computer, so it is a safe number to use.

Just increase your hash length depending on the number of lines you plan to identify, and choose a collision probability that suits you.

0
source

Source: https://habr.com/ru/post/1468883/


All Articles