Storing C # GetHashCode () in DB is unreliable

Possible duplicate:
How to create a HashCode in .net (C #) for a string safe for storage in the database?

I plan to store hundreds of thousands of URLs in my database. Every row in my UrlInfo table will be unchanged since the URL itself is a logical primary key. Since the URL can be quite long, I decided to change the URL as a quick way to find possible matches when adding new lines. A hash is not my true key, just a way to quickly find possible matches. In addition, I use the RegEx pattern for each domain, which translates the essence of the URL into something that can be compared with other URLs. I save the RegEx results as well as a hash, and I don't care if it gives possible duplicates.

Everything went fine until I found out that the C # string.GetHashCode () method that I used for hash things is not guaranteed to be unique in all .NET implementations. I noticed this when I tried to move a hash function from ASP.Net to SQLServer CLR code. The web application uses .Net 4.0 and SQLServer 2008 R2, I found out it uses .Net 3.5. They gave separate hash results for the same string, so now I need to get away from using string.GetHashCode () because I don't want to worry about this when upgrading my application to future versions of .Net.

So the questions are:

  • I smell my architecture since I store a hash in my db? Any better ways? Obviously, Microsoft does not want me to save the hash results!

  • Can someone recommend a good C # replacement algorithm for hashing strings? I saw Jon here , but not quite sure how to modify to work for strings (loop through each char using ascii codes?).

  • Are there string compression algorithms that are better than using a hash algorithm?

thanks

AMAZING ANSWERS LOT. THANKS YOU VERY !!!

+4
source share
5 answers

Instead, you can always use an MD5 hash, which is relatively fast:

public string GetUrlHash(string url) { byte[] hash = MD5.Create().ComputeHash(Encoding.UTF8.GetBytes(url)); StringBuilder sb = new StringBuilder(); for (int i = 0; i < hash.Length; i++) { sb.Append(hash[i].ToString("X2")); } return sb.ToString(); } 

Call it like this:

 Console.WriteLine(this.GetUrlHash("http://stackoverflow.com/questions/5355003/storing-c-gethashcode-in-db-is-unreliable")); 

And we get:

 > 777BED7F83C66DAC111977067B4B4385 

This should be reliable enough in terms of uniqueness. MD5 is currently unsafe for password-based applications, but you do not have this problem.

The only problem is using a row such as the primary key in the table can be problematic in terms of performance.

Another thing you can do is use the URL shortening approach: use the database sequence generation function and convert the value (make sure you use the LONG or BIGINT equivalent!) For something like Base36, which gives you a nice, short line .

+2
source

A similar question is asked here:

How to create HashCode in .net (C #) for a string that is safe to store in a database?

This can help solve your problem.

+1
source

As a side note, SQL Server 2008 supports (has) the HASHBYTES function, which, given some data (such as a string), can generate an MD2, MD4, MD5, SHA, or SHA1 hash.

+1
source

I would say that you probably don't need to store the hash.

Just make sure you have the correct URL column in your table (unique index) and the search should be quick.

0
source

Have you considered scrolling a line and storing VARBINARY? It can be much smaller, you can create an index directly on it.

0
source

Source: https://habr.com/ru/post/1344301/


All Articles