Is String.hashCode () ineffective?

If you look at the source code of java.lang.String openjdk-1.6 , I saw that String.hashCode () uses 31 as a prime and calculations

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1] 

Now the reason for this was the question I had in mind: comparing the hash codes in String.equals would make String.equals much faster. But, looking now at hashCode, the following questions come to me:

  • It would not be more of a major help to avoid collisions, at least for short lines, seeing that, for example, โ€œBCโ€ has the same hash as โ€œAbโ€ (since the letters ascii live in the region 65-122, t better than working better)?
  • Is a conscious decision to use 31 as simple or just random, which is used because it is common?
  • How likely is a hash collision given a fixed string length? where this question is the headline is the original question, how good a comparison of hash codes and line lengths can already distinguish between lines to avoid comparing the actual content.
  • a little off topic, maybe: is there a good reason String.equals doesn't compare hashCodes as an extra shortcut?
  • a little off topic: it is assumed that we have two lines with the same content, but different instances: is there a way to assert equality without actually comparing the contents? I would suggest that no, because to some extent the length of the String explodes to the size where we will inevitably run into, but what about some restrictions - only a specific character set, maximum string length ... and how much we need to limit the string space to have such a hash function?
+6
source share
1 answer

Wouldn't it be more of a major help to avoid collisions, at least for short lines, seeing that, for example, "BC" has the same hash as "Ab" (since the letters ascii live in the region 65-122, t better than working better)?

Each character in a string can take 65536 values โ€‹โ€‹(2 ^ 16). Therefore, the set of strings of 1 or 2 characters is larger than the number of int , and any methodology for calculating the hash code will lead to collisions for strings of 1 or 2 characters (which can be considered short strings, which I assume).

If you limit your character set, you can find a hash function that reduces the number of collisions (see below).

Note that a good hash should also provide a good distribution of output. The comment buried in this code protects the use of 33 and gives the following reasons (emphasis mine):

If we compare the values โ€‹โ€‹of chi ^ 2 [...] variants, the number 33 does not even have the best value. But the number 33 and several other equally good numbers, such as 17, 31, 63, 127 and 129, nevertheless, are a great advantage for the remaining numbers in the large set of possible factors: their multiple operation can be replaced by a faster operation based on just one shift plus one addition or subtraction operation. And , because the hash function must both distribute the good, and must be very fast to calculate, these few numbers should be preferable .

Now these formulas were developed some time ago. Even if it now turned out that they were not perfect, it would be impossible to change the implementation, since it is documented in the contract of the String class.

Is a conscious decision to use 31 as simple, or just random, which is used because it is common?

Why does Java's hash code in String use 31 as a multiplier?

How likely is a hash collision given a fixed string length?

Assuming that every possible int value has the same probability of being the result of a hashcode function, the chance of collision is 1 in 2 ^ 32.

Is there a good reason String.equals doesn't compare hash codes as an extra shortcut?

Why doesn't the equals method in String use a hash?

Suppose we have two lines with the same content, but different instances: is there a way to assert equality without actually comparing the contents?

Without any string restriction, no. You can put lines, then check for reference equality ( == ), but if many lines are involved, this can be inefficient.

how much do we need to limit line space in order to have such a hash function?

If you allow only small letters (26 characters), you can create a hash function that generates unique hashes for any strings from 0 to 6 characters long (inclusive) ( sum(i=0..6) (26^i) = 3.10^8 ).

+6
source

Source: https://habr.com/ru/post/949634/


All Articles