We have a database with hundreds of millions of log data entries. We are trying to "group" the log data as having the same nature as other entries in the log database. For instance:
Entry X may contain a journal entry, for example:
Change transaction ABC123 assigned to server US91
And the entry Y may contain a journal entry, for example:
Change transaction XYZ789 assigned to server GB47
To us humans, these two journal entries are easily recognizable, as they are probably related in some way. Now there can be 10 million lines between the X record and the Y record. And there may be thousands of other entries similar to X and Y, and some of them are completely different, but other entries are similar.
What I'm trying to determine is the best way to group similar elements together and say that with a confidence of XX%, record X and record Y are probably of the same nature. Or maybe the best way to say that the system will look at the record Y and speak based on your content, which is most similar to the record X, like all other records.
I saw some mention of natural language processing and other ways to look for similarities between strings (for example, just rude forced Levenshtein calculations) - however, we have two additional problems for us:
- Content is generated by the machine, not by humans.
- Unlike the search engine approach, where we determine the results for a given query, we try to classify the giant repository and group them as equally to each other.
Thanks for your input!
source share