Can two strings be compared by their "hash digits"?

Question

Can two strings be compared by their "hash digits"?

I have a line that is lost forever. The only thing I have is some kind of magic hash number. Now I have a new line, which may be similar or equal to lost. I need to find out how close this is.

Integer savedHash = 352736; String newText = "this is new string"; if (Math.abs(hash(newText) - savedHash) < 100) { // wow, they are very close! }

Are there any algorithms for this purpose?

ps. Text length is not fixed.

SFC. I know how regular hash codes work. I'm interested in an algorithm that will work differently, providing me with the functionality described above.

PPP. In a very simple scenario, this hash() method will look like this:

 public int hash(String txt) { return txt.length(); }

+4

java algorithm design-patterns

yegor256 Mar 29 '11 at 16:07

source share

10 answers

No, that will not work. The hash similarity is not related to the similarity of the source lines. In fact, it is possible that two different lines have the same hash. All you can say for sure is that if the hashes are different, the lines were different.

[Edited in the light of the commentary, the probability of a collision is, of course, very real)

Edit for clarification:

If you only have a hash of the old string, then you cannot find the original value of this string. There is no algorithm that would tell you if the hashes of 2 different lines represented lines that were close, and even if that were, it would not help. Even if you find a string that hashes exactly matches your old string, you still don’t know if it was your original string, since any number of lines can produce the same value of the hash function. In fact, there are a huge number of lines that can create the same hash.

[Theoretically, this huge amount is virtually infinite, but on any real storage system you cannot generate an infinite number of rows. In any case, your chance of matching an unknown string using this approach is very thin, unless your hashes are large relative to the input string, and even then you will need to redirect the force through all possible strings]

+4

Steve haigh Mar 29 '11 at 16:08

source share

If the hashes do not match, the lines are different.

If the hashes match, the lines are likely to match.

There is nothing else you can do from a hash value.

+4

Ferruccio Mar 29 '11 at 16:10

source share

As others have pointed out, with a typical hashing algorithm, it just doesn't work like that.

There are, however, a few people who have developed algorithms that are at least somewhat similar to this. For example, there is a company called "Xpriori" that has some hashing (or least hash-like) algorithms that allow similar things. They will allow you to compare the degree of similarity or (for example) so that you combine the hashes in this way hash(a) + hash(b) == hash(a+b) (for some definition of + , and not just for simply adding numbers). As with most hashes, there is always a chance of a collision, so you have a chance of a false positive result (but by choosing the size of the hash, you can set this chance to an arbitrarily small value).

Thus, if you are dealing with existing data, you are probably out of luck. If you create something new and want opportunities in this order, it is possible, although trying to do it alone is seriously nontrivial.

+1

Jerry Coffin Mar 29 '11 at 17:43

source share

No. The hashes are designed so that minor changes to the input string cause huge differences in the resulting hash. This is very useful for implementing dictionaries, as well as for checking file integrity (one changed bit will lead to a completely different hash). So no, this is not some thing that you could ever use as a comparison of inequality.

0

Chris eberle Mar 29 '11 at 16:10

source share

If the hash codes are different, they cannot be the same String, however many strings can have the same hashCode ().

Depending on the nature of the strings, performing a simple comparison may be more efficient than comparing the hash code (), it should check and perform calculations for each character, while the comparison may store earlier, for example, if the length is different or as soon as it sees another symbol.

0

Peter Lawrey Mar 29 '11 at 16:15

source share

Any good hashing algorithm, by definition, will NEVER produce similar hashes for similar arguments. Otherwise, it would be too easy to crack. If the hashed value "aaaa" is like "aaab", then this is a bad hash. I've been tormenting such things for so long without much difficulty (a fun puzzle to solve!) But you never know, maybe your hash algorithm is low. The idea of what it is?

If you have time, you can simply reinstall this solution by hashing all possible words. Not elegant, but possible. Easier if you know the length of the original word.

If the standard has an algorithm, such as MD5, you can find websites that already have large source and hash mappings and get the answer this way. Try http://hashcrack.com/

I successfully used this site after the release of one of our developers, and I needed to recover the password.

Greetings

Daniel

0

Daniel Williams Mar 29 '11 at 16:15

source share

You can consider the line as a really large number, but this concerns the degree of your abilities in the general situation. If you have a specific problem domain, you can compress the string representation to something less without loss, but it will not be very useful anyway.

For example, if you work with single words, you can use soundex to compare how similar two words will sound ...

The best you can do with traditional hash codes is to compare two lines for equality and probable inequality. False positives are possible, but there will be no false negatives. However, you cannot compare this similarity.

0

Jonathan Mar 29 '11 at 16:17

source share

the normal hash code changes a lot when the object changes a little. who did to distinguish between different objects and do not care how they can be similar. therefore the answer is not

0

Evegeny Mar 29 '11 at 17:36

source share

Well, it looks like you need not a real hash of the string, but some fingerprint of the string. Since you want it to consist of 32 bits, one way could be as follows:

Calculate the Pearson correlation coefficient between the first and second half of the line (if the line length is an odd number of characters, then add some padding) and save this number as a 32-bit floating point number. But I'm not sure how reliable this method will be.

== EDIT ==
Here is a sample C code (not optimized) that implements this idea (slightly modified):

 #include <stdio.h> #include <stdlib.h> #include <math.h> #include <string.h> float mean(char *str) { char *x; float sum = 0.0; for(x=str; *x!='\0'; x++) { sum += (float) *x; } return sum/strlen(str); } float stddev(char *str) { char *x; float sum = 0.0; float u = mean(str); for(x=str; *x!='\0'; x++) { sum += ((float)*x - u)*((float)*x - u); } return sqrt(sum/strlen(str)); } float covariance(char *str1, char *str2) { int i; int im = fmin(strlen(str1),strlen(str2)); float sum = 0.0; float u1 = mean(str1); float u2 = mean(str2); for(i=0; i<im; i++) { sum += ((float)str1[i] - u1)*((float)str2[i] - u2); } return sum/im; } float correlation(char *str1, char *str2) { float cov = covariance(str1,str2); float dev1 = stddev(str1); float dev2 = stddev(str2); return cov/(dev1*dev2); } float string_fingerprint(char *str) { int len = strlen(str); char *rot = (char*) malloc((len+1)*sizeof(char)); int i; // rotate string by CHAR_COUNT/2 for(i=0; i<len; i++){ rot[i] = str[(i+len/2)%len]; } rot[len] = '\0'; // now calculate correlation between original and rotated strings float corr = correlation(str,rot); free(rot); return corr; } int main() { char string1[] = "The quick brown fox jumps over the lazy dog"; char string2[] = "The slow brown fox jumps over the crazy dog"; float f1 = string_fingerprint(string1); float f2 = string_fingerprint(string2); if (fabs(f1 - f2) < 0.2) { printf("wow, they are very close!\n"); } return 0; }

NTN!

0

Agnius vasiliauskas Mar 30 '11 at 8:13

source share

Avi · Accepted Answer · 2011-03-29T16:13:48+0000

Standard hashing will not work in this case, since close hash values do not mean close lines. In fact, most hash functions are designed to give very different values to nearby strings to create a random distribution of hash values for any given set of input strings.

If you have access to both lines, you can use some kind of row distance function, such as Levenshtein distance . This calculates the editing distance between two lines or the number of changes needed to convert one to the other.

However, in this case, the best approach might be to use some kind of fuzzy hashing . This way you do not need to keep the original string and still get some degree of similarity.

Can two strings be compared by their "hash digits"?

More articles: