Lines that are highlighted in the file but are not exact duplicates

Question

Lines that are highlighted in the file but are not exact duplicates

I am combing the Webapp log file for expressions that stand out.

Most of the lines are similar and uninteresting. I would pass them through Unix uniq, however this does not filter anything, since all lines are slightly different: they all have a different timestamp, similar operators can print a different user ID, etc.

Which way and / or tool to get only strings that are noticeably different from others? (But, again, not exact duplicates)

I was thinking about playing with Python difflib , but this seems like a difference between two files, not all pairs of lines in the same file.

[EDIT]

I suggested that the solution would give a unique estimate for each row. Thus, “especially different,” I mean, I choose a threshold that should evaluate the uniqueness score for any row that should be included in the output.

Based on this, if there are other viable ways to define it, please discuss. In addition, the method should not have 100% accuracy and reminder.

[/ EDIT]

Examples:

I would prefer answers that are as common as possible. I know that I can drop the timestamp at the beginning. Disruption of the end is more complicated, since its language can be completely unlike anything else in the file. Such details are explained by why I avoided specific examples before, but because some people asked ...

Similar to 1:

2009-04-20 00:03:57 INFO  com.foo.Bar - URL:/graph?id=1234
2009-04-20 00:04:02 INFO  com.foo.Bar - URL:/graph?id=asdfghjk

Similar to 2:

2009-04-20 00:05:59 INFO  com.baz.abc.Accessor - Cache /path/to/some/dir hits: 3466 / 16534, 0.102818% misses
2009-04-20 00:06:00 INFO  com.baz.abc.Accessor - Cache /path/to/some/different/dir hits: 4352685 / 271315, 0.004423% misses

Miscellaneous 1:

2009-04-20 00:03:57 INFO  com.foo.Bar - URL:/graph?id=1234
2009-04-20 00:05:59 INFO  com.baz.abc.Accessor - Cache /path/to/some/dir hits: 3466 / 16534, 0.102818% misses

1, , , , . , ( ). , .

+3

python algorithm unix grep nlp

Bluu 20 . '09 19:30

6

" ". " " .

+3

Charlie Martin 20 . '09 19:35

, , , .

, , .

.

+2

RossFabricant 20 . '09 19:35

, " " / " "?

. ( , "INFO", ):

def score(s1, s2, offset=26):
    words1 = re.findall('\w+', s1[offset:])
    words2 = re.findall('\w+', s2[offset:])
    return float(len(set(words1) & set(words2)))/max(len(set(words1)), len(set(words2)))

:

>>> s1
'2009-04-20 00:03:57 INFO  com.foo.Bar - URL:/graph?id=1234'
>>> s2
'2009-04-20 00:04:02 INFO  com.foo.Bar - URL:/graph?id=asdfghjk'
>>> s3
'2009-04-20 00:05:59 INFO  com.baz.abc.Accessor - Cache /path/to/some/dir hits: 3466 / 16534, 0.102818% misses'
>>> s4
'2009-04-20 00:06:00 INFO  com.baz.abc.Accessor - Cache /path/to/some/different/dir hits: 4352685 / 271315, 0.004423% misses'

:

>>> score(s1,s2)
0.8571428571428571
>>> score(s3,s4)
0.75
>>> score(s1,s3)
0.066666666666666666

, . set() - : -)

+1

John Fouhy 21 . '09 2:21

, , . , , , :

2009-04-20 00:03:57 INFO  com.foo.Bar - URL:/graph?id=1234
                    ^---------------------^ 

2009-04-20 00:05:59 INFO  com.baz.abc.Accessor - Cache /path/to/some/dir hits: 3466 / 16534, 0.102818% misses
                    ^--------------------------------^

, , ( , , ):

/^.{20}(\w+\s+[\w\.-]+\s+-\s+\w+)/

0

Svante 21 . '09 0:26

, , " " ( "" ). Haskell:

module Main where 
import Data.List (nubBy, sortBy)

sortAndNub s = nubBy fields2and3 
     $ sortBy fields2and3comp
     $ map words $ lines s

fields2and3 a b =    fieldEq 2 a b 
                  && fieldEq 3 a b
fieldEq f a b = a!!f == (b!!f)
fields2and3comp a b = case compare (a!!2) (b!!2) of
   LT -> LT
   GT -> GT
   EQ -> compare (a!!3) (b!!3)
main = interact $ unlines.(map unwords).sortAndNub

0

ja. 21 . '09 1:16

dmckee · Accepted Answer · 2009-04-20T19:37:47+0000

, , :

, , ?

(, )
,

unix procmail, , .

, , , (, , IP-), : , HTTP- ... - , .

, , -

Lines that are highlighted in the file but are not exact duplicates

More articles: