Is there a way to compute something like a general "similarity score" of a string? In a sense, that I am not comparing the two lines together, but I get some numbers / ratings (hash) for each line, which later can tell me that the two lines are or are not similar. Two similar lines should have similar (close) estimates / hashes.
Consider these lines and grades as an example:
Hello world 1000
Hello World! 1010
Hello Earth 1125
Foo bar 3250
FooBarbar 3750
Foo Bar! 3300
World Foo! 2350
You can see this world Hello! and the world of Hello is similar, and their ratings are close to each other.
Thus, the search for the most similar rows for a given row will be performed by subtracting the given row values ββfrom other estimates and then sorting their absolute value.
My ultimate goal: there are streaming log messages (pure messages only), and I want to find a template for these messages (some kind of regular expression). But this only starts when I can store similar strings. I will focus again on the fact that I have to get some numbers / estimates (hash) for each line AND WHAT MAY Later tell me that the two lines are or are not similar
Ajay source share