C # hash search is faster than MD5 or SHA256

I am trying to find something potentially faster than SHA256. I have over 1 billion entries that I need for a hash, and check to see if they are unique. I am currently running it through MD5, which seems pretty fast, and then through sha256 to avoid collisions. Running them in that order seems to give me a slight performance boost, but I still need it faster. I am looking for names or examples of some hashes made in C # or some kind of pseudocode, so I can recreate it in C #.

+4
source share
7 answers

The answers here are a lot of dubious information. You tagged your question with cryptography and only mention cryptographic hash functions, but it looks like you don't need cryptographic protection, in particular because you say:

I have over 1 billion entries that I need for a hash and check to see if they are unique.

There are four properties of a cryptographic hash function :

  • easy to calculate hash value for any given message
  • cannot create message with given hash
  • cannot change message without changing hash
  • It is not possible to find two different messages with the same hash.

You are really only interested in the first quality, and uniqueness is a requirement of a smaller scale, which is partially associated with three other properties of cryptographic security.

Why does it bother you?

There is overhead in cryptographic security. You don’t need it, and you are interested in speed, so why not miss it? The width of the MD5 hash and the SHA family are admittedly large enough for your purposes.

Check out the list of hash functions on Wikipedia or check out the article on normal hash functions . Moreover, what happened to the built-in .NET hashing functions? Have you tried simply deferring to the Object.GetHashCode() method? This MSDN link has a lot to say about the use of hash functions. You do not say much about the data that you hash, so it’s hard to say whether the result will be unique between your objects or not. How do you load an object into an MD5 hash? I assume you are taking a binary representation. A similar approach can be used to use the built-in non-critical hash function.

You may be concerned about the uniqueness of the built-in hash functions. They only return a regular int, which is 2 ^ 32, only 4 times larger than the dataset you are working with. However, you always need to have a backup plan for hash functions. Collisions are unacceptable, not impossible. The standard reserve is to make a more expensive comparison, usually a link comparison and a field comparison.

If you are not ready to accurately compare your hash outputs, you basically count until you get a false result. This may not be a big problem for you: only you can judge what is there.

In addition, performing another calculation of the hash function is probably not much faster than a direct comparison. You are better off on all counts with a confident thing and make a long, direct comparison.

Another common collision avoidance method is to use multiple keys. Therefore, if your data points have several large subcomponents, you haveh and compare them yourself. If it has some large and some small components (say, some simple numeric types), you make the hash large and make a direct comparison with the small ones. If they have some data that is easy to take a serial number (for example, the length of the lines or the size of some containers), you can perform a direct comparison of these bits.

If this does not work for you, take a look at the implementation of the other hash functions listed on the wiki. Here's a pretty good reference for MurmerHash3 , which can calculate 32-bit or 128-bit hash values. There are other hash functions in the list with long hash widths as well as available C # libraries. But, as this link points out, Murmurhash is faster than the MD5 and SHA functions, although it does not directly compare with the Object.GetHashCode method mentioned above.

+3
source

How to do something else?

Use a simple hash function for each record, for example, the one you would use when inserting a record into a hash table, possibly matching each record with a 32-bit INT. Then, if a hash collision occurs, you compare the counter records for uniqueness.

+2
source

You can use MD5 if you encounter counter records, you can check them with SHA256 or even SHA128.

+1
source

Do you check every entry with sha256? You only need to check the entries where you have md5 collisions, which should be rare even with md5. And at this point, when you are simply comparing duplicates, it is likely that you are simply comparing the original record with the raw record, because the comparison will return to the first difference.

+1
source

You can even do something like Take MD5, and if you get a collision, add some extra data (the same) to both values ​​and take MD5 again. It is very unlikely that 2 would collide again if they were different. So instead of doing SHA after a collision, do MD5 again with something added, which should be faster.

0
source

https://github.com/noricube/xxHashSharp has the fastest hashing algorithm, however it is not suitable for encryption.

0
source

From the way you formulated the question, it does not seem to you a necessary hash algorithm for the security class. You may not need a hash algorithm at all if you have passed all the basic requirements of what you are trying to fulfill.

If you create a method called unique that returns a boolean true if and only if two strings are unique, you can get speed and maintain reliability by using the following three string characteristics in this order.

  • length (if they are NOT fixed length records)
  • Check sum
  • actual value

The first is probably already known if the record length is variable. The second can be quickly calculated during storage. With a billion records, you will have to cover the chance of collisions, even if you use security hash algorithms (which, as you said, are too slow). Therefore, when the checksum matches, which is rare, if you have enough bits in the checksum, you will have to cover the case of comparing actual bytes by bytes.

0
source

Source: https://habr.com/ru/post/1488682/


All Articles