Determining the similarity of items in a database

We have a database with hundreds of millions of log data entries. We are trying to "group" the log data as having the same nature as other entries in the log database. For instance:

Entry X may contain a journal entry, for example:

Change transaction ABC123 assigned to server US91

And the entry Y may contain a journal entry, for example:

Change transaction XYZ789 assigned to server GB47

To us humans, these two journal entries are easily recognizable, as they are probably related in some way. Now there can be 10 million lines between the X record and the Y record. And there may be thousands of other entries similar to X and Y, and some of them are completely different, but other entries are similar.

What I'm trying to determine is the best way to group similar elements together and say that with a confidence of XX%, record X and record Y are probably of the same nature. Or maybe the best way to say that the system will look at the record Y and speak based on your content, which is most similar to the record X, like all other records.

I saw some mention of natural language processing and other ways to look for similarities between strings (for example, just rude forced Levenshtein calculations) - however, we have two additional problems for us:

  • Content is generated by the machine, not by humans.
  • Unlike the search engine approach, where we determine the results for a given query, we try to classify the giant repository and group them as equally to each other.

Thanks for your input!

+4
source share
5 answers

An interesting problem. Obviously, this is a scale problem because you really don't want to start comparing each record with every other record in the database. I believe that I would look at the list of "known types" and scoring entries against the types in this list to see if each entry in the list matches.

The "scoring" part, we hope, will find some good answers here - your ability to score against known types is the key to making this work well, and I have the feeling that you are in a better position than we should get it like that. Maybe some kind of consonant sound? Or, if you can figure out how to “discover” which parts of the new records are changing, you can define your well-known types as regular expression expressions.

At this point, for each entry, you can hope that you have a match (with high confidence) or a match (with less confidence) or, most likely, no match at all. In this latter case, most likely you have found a new "type" that should be added to the list of "known types". If you keep track of the score for each record you match, you can also go back to small scoring matches and see if a better match appears later in your processing.

+1
source

I would suggest indexing your data using a text search engine such as Lucene to split log entries into terms. Since your data is generated by the machine, also use the words bigrams and tigrams, even higher n-grams. A bigram is just a sequence of consecutive words, in your example you will have the following bigrams:

  Change_Transaction, Transaction_XYZ789, XYZ789_Assigned, Assigned_To, To_Server, Server_GB47

For each journal, preparing queries in the same way, a search engine can give you the most similar results. You may need to pick up a similarity feature a bit to get better results, but I think this is a good start.

+1
source

Two main strategies come to my mind:

  • ad-hoc. Use the information retrieval approach. Create an index for log entries, eventually using a specialized tokenizer / parser, by submitting them to a regular text search engine . I heard people do this with Xapian and Lucene. You can then “search” for a new journal entry, and the text search engine (hopefully) will return some related journal entries to compare it. Usually, the “information search” approach, however, is only interested in finding the 10 most similar results.

  • clustering approach. Usually you need to turn the data into numerical vectors (which can, however, be sparse), for example. like TF-IDF. Then you can apply the clustering algorithm to find groups of close lines (for example, the above example) and study their nature. You may need to adjust this a bit, so it is not. cluster on server id.

Both strategies have their ups and downs. The first one is pretty fast, but it will always just return some of the same existing journal lines to you, without much expense on how common this line is. It is mainly useful for human inspection.

The second strategy is more intensively calculated, and depending on your parameters, it may fail (maybe test it on a subset first), but it can also give more useful results, in fact by creating large groups of journal entries that are very closely related.

+1
source

It looks like you could take the lucene approach mentioned above and then use it as a source to input vectors into the Mahout computer learning library (http://mahout.apache.org/). After that, you can train the classifier or just use one of your clustering algorithms.

0
source

If your DBMS has this, take a look at SOUNDEX ().

-2
source

Source: https://habr.com/ru/post/1386602/


All Articles