I am combing the Webapp log file for expressions that stand out.
Most of the lines are similar and uninteresting. I would pass them through Unix uniq, however this does not filter anything, since all lines are slightly different: they all have a different timestamp, similar operators can print a different user ID, etc.
Which way and / or tool to get only strings that are noticeably different from others? (But, again, not exact duplicates)
I was thinking about playing with Python difflib , but this seems like a difference between two files, not all pairs of lines in the same file.
[EDIT]
I suggested that the solution would give a unique estimate for each row. Thus, βespecially different,β I mean, I choose a threshold that should evaluate the uniqueness score for any row that should be included in the output.
Based on this, if there are other viable ways to define it, please discuss. In addition, the method should not have 100% accuracy and reminder.
[/ EDIT]
Examples:
I would prefer answers that are as common as possible. I know that I can drop the timestamp at the beginning. Disruption of the end is more complicated, since its language can be completely unlike anything else in the file. Such details are explained by why I avoided specific examples before, but because some people asked ...
Similar to 1:
2009-04-20 00:03:57 INFO com.foo.Bar - URL:/graph?id=1234
2009-04-20 00:04:02 INFO com.foo.Bar - URL:/graph?id=asdfghjk
Similar to 2:
2009-04-20 00:05:59 INFO com.baz.abc.Accessor - Cache /path/to/some/dir hits: 3466 / 16534, 0.102818% misses
2009-04-20 00:06:00 INFO com.baz.abc.Accessor - Cache /path/to/some/different/dir hits: 4352685 / 271315, 0.004423% misses
Miscellaneous 1:
2009-04-20 00:03:57 INFO com.foo.Bar - URL:/graph?id=1234
2009-04-20 00:05:59 INFO com.baz.abc.Accessor - Cache /path/to/some/dir hits: 3466 / 16534, 0.102818% misses
1, , , , . , ( ). , .