ALGORITHM - string affinity / hash score

Question

ALGORITHM - string affinity / hash score

Is there a way to compute something like a general "similarity score" of a string? In a sense, that I am not comparing the two lines together, but I get some numbers / ratings (hash) for each line, which later can tell me that the two lines are or are not similar. Two similar lines should have similar (close) estimates / hashes.

Consider these lines and grades as an example:

Hello world 1000

Hello World! 1010

Hello Earth 1125

Foo bar 3250

FooBarbar 3750

Foo Bar! 3300

World Foo! 2350

You can see this world Hello! and the world of Hello is similar, and their ratings are close to each other.

Thus, the search for the most similar rows for a given row will be performed by subtracting the given row values from other estimates and then sorting their absolute value.

My ultimate goal: there are streaming log messages (pure messages only), and I want to find a template for these messages (some kind of regular expression). But this only starts when I can store similar strings. I will focus again on the fact that I have to get some numbers / estimates (hash) for each line AND WHAT MAY Later tell me that the two lines are or are not similar

+6

python string algorithm hash cluster-analysis

Ajay Jul 12 '11 at 2:00 p.m.

source share

8 answers

Daren thomas · Answer 1 · 2011-07-12T14:03:24+0000

There are several such “ratings”, but they all depend on how you define the similarities.

I think the python library already has soundex .
you can also calculate the Levenshtein distance of two lines
NYSIIS?

Jeff foster · Answer 2 · 2011-07-12T14:08:07+0000

See location sensitivity .

The main idea is to hash the input elements so that similar elements are mapped to the same buckets with a high probability (the number of buckets is much smaller than all possible input elements).

There is a very good explanation here along with some sample code.

Alexander torstling · Answer 3 · 2011-07-13T07:21:17+0000

TL DR: Python BK-tree

Interest Ask. I have limited experience in this area, but since the Levenshtein distance corresponds to the triangle inequality, I decided that there should be a way to calculate some absolute distance to the origin in order to find lines close to each other without performing direct comparisons with all records in the entire database data.

While searching on some terms related to this, I found one particularly interesting thesis: Aspects of metric spaces in the calculation of Matthew Adam Scala.

On page 26, he discusses similarity measures based on kd trees and others, but concludes:

However, common metric spaces do not provide the geometry required by these methods. For a common metric space without other assumptions, it is necessary to use distance to use the distance approach, which indices indicate only on the basis of their distance from each other. Burkhard and Keller [35] proposed one of the first such index structures, which are now known as the BK tree for their initials, in 1973. The BK tree assumes that the metric has several discrete return values, each internal node contains a point of view, and subtrees correspond to different metric values.

The blog post on how BK trees work can be found here .

In the thesis, Scala continues to describe other solutions to this problem, including VP-tree and GH-trees. Chapter 6 analyzes distances based on Levenshtein editing distance. It also presents some other interesting distance metrics for strings.

I also found the Basics of multidimensional and metric data structures that seem to be relevant to your question.

utdemir · Answer 4 · 2011-07-12T14:38:46+0000

You can always use the Levenshtein distance, there is also a written implementation: http://code.google.com/p/pylevenshtein/

But for simplicity, you can use the difflib built-in module:

 >>> import difflib >>> l {'Hello Earth', 'Hello World!', 'Foo Bar!', 'Foo world!', 'Foo bar', 'Hello World', 'FooBarbar'} >>> difflib.get_close_matches("Foo World", l) ['Foo world!', 'Hello World', 'Hello World!']

http://docs.python.org/library/difflib.html#difflib.get_close_matches

Abhay buch · Answer 5 · 2011-07-12T17:31:39+0000

You can use fuzzy hashing to quickly determine the sequence of strings.

Wolfwyrd · Answer 6 · 2011-07-12T14:05:41+0000

You may be interested in Hamming Distance . The python function hamming_distance () calculates the Hamming distance between two lines.

 def hamming_distance(s1, s2): assert len(s1) == len(s2) return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

Ryan ginstrom · Answer 7 · 2013-04-29T20:56:09+0000

You might want to use BK-Tree . Here is a discussion and implementation of python .

BK-Tree stores rows in a tree, sorted by Levenshtein distance to the parent nodes. This is usually used to trim the search space when searching for similar strings, but it seems that this tree would form a natural order that could be used to create clusters.

Rafael rios · Answer 8 · 2015-08-04T19:49:24+0000

I don’t know if you continue to do this, but in information theory there is a way to determine how much information a line or piece of text contains, maybe you can use this value as a hash to sort your lines. This is called entropy, and wikipedia has a nice article about it: https://en.wikipedia.org/wiki/Entropy_(information_theory)

ALGORITHM - string affinity / hash score

More articles: