PROBLEM: Often I am faced with the need to see what are the most frequently repeated "patterns" during the last day of certain magazines. For example, for a small subset of tomcat logs:
GET /app1/public/pkg_e/v3/555413242345562/account/stats 401 954 5 GET /app1/public/pkg_e/v3/555412562561928/account/stats 200 954 97 GET /app1/secure/pkg_e/v3/555416251626403/ex/items/ 200 517 18 GET /app1/secure/pkg_e/v3/555412564516032/ex/cycle/items 200 32839 50 DELETE /app1/internal/pkg_e/v3/accounts/555411543532089/devices/bbbbbbbb-cccc-2000-dddd-43a8eabcdaa0 404 - 1 GET /app1/secure/pkg_e/v3/555412465246556/sessions 200 947 40 GET /app1/public/pkg_e/v3/555416264256223/account/stats 401 954 4 GET /app2/provisioning/v3/555412562561928/devices 200 1643 65 ...
If I want to find out the most commonly used URLs (along with the method and retcode) I will do:
[ root@srv112 :~]$ N=6;cat test|awk '{print $1" "$2" ("$3")"}'\ |sed 's/[0-9a-f-]\+ (/%GUID% (/;s/\/[0-9]\{4,\}\//\/%USERNAME%\//'\ |sort|uniq -c|sort -rn|head -$N 4 GET /app1/public/pkg_e/v3/%USERNAME%/account/stats (401) 2 GET /app1/secure/pkg_e/v3/%USERNAME%/devices (200) 2 GET /app1/public/pkg_e/v3/%USERNAME%/account/stats (200) 2 DELETE /app1/internal/pkg_e/v3/accounts/%USERNAME%/devices/%GUID% (404) 1 POST /app2/servlet/handler (200) 1 POST /app1/servlet/handler (200)
If I want to find out the most common username from the same file, I will do:
[ root@srv112 :~]$ N=4;cat test|grep -Po '(?<=\/)[0-9]{4,}(?=\/)'\ |sort|uniq -c|sort -rn|head -$N 9 555412562561928 2 555411543532089 1 555417257243373 1 555416264256223
The above works quite well on small data sets, but for larger input sets the performance (complexity) sort|uniq -c|sort -rn|head -$N unbearable (we are talking about ~ 100 servers, ~ 250 log files per server, ~ 1 million lines for each log file)
ATTEMPT TO SOLVE: |sort|uniq -c part is easily replaced by awk 1-liner slot, turning it into:
|awk '{S[$0]+=1}END{for(i in S)print S[i]"\t"i}'|sort -rn|head -$N
but I could not find a standard / simple and economical implementation of the "quick pick algorithm" (discussed here ) to optimize the |sort -rn|head -$N . There was a search for GNU files, rpms, awk-1-liners, or some easily compiled Ansi C code that I could transfer / distribute through data centers to include:
3 tasty oranges 225 magic balls 17 happy dolls 15 misty clouds 93 juicy melons 55 rusty ideas ...
in (at N = 3):
225 magic balls 93 juicy melons 55 rusty ideas
I could probably take a sample of Java code and transfer it to the format above stdin (by the way, I was surprised by the lack of .quickselect(...) in the kernel core), but the need to deploy java-runtime is not attractive everywhere. Perhaps I could take a piece of code based on an array (based on an array), and then adapt it to the stdin format above, and then check and fix the leaks, etc. For a while. Or even implement it from scratch in awk. BUT (!) - this simple need probably comes across more than 1% of people on a regular basis - should there be a standard (pre-tested) implementation there? I hope ... maybe I'm using the wrong keywords to find it ...
OTHER OBSTACLES: Also encountered several problems for working with large data sets:
- log files are located on NFS volumes on which ~ 100 servers are installed, so it made sense to parallelize and divide the work into smaller pieces.
- higher
awk '{S[$0]+=1}... memory is required - I see that it dies when it eats up to 16 GB (despite the presence of 48 GB of free RAM and a lot of swap ... maybe some linus limit, which I forgot)
My current solution is still not reliable and not optimal (in the process) as follows:
find /logs/mount/srv*/tomcat/2013-09-24/ -type f -name "*_22:*"|\ # TODO: reorder 'find' output to round-robin through srv1 srv2 ... # to help 'parallel' work with multiple servers at once parallel -P20 $"zgrep -Po '[my pattern-grep regexp]' {}\ |awk '{S[\$0]+=1} END{for(i in S)if(S[i]>4)print \"count: \"S[i]\"\\n\"i}'" # I throw away patterns met less than 5 times per log file # in hope those won't pop on top of result list anyway - bogus # but helps to address 16GB-mem problem for 'awk' below awk '{if("count:"==$1){C=$2}else{S[$0]+=C}} END{for(i in S)if(S[i]>99)print S[i]"\t"i}'|\ # I also skip all patterns which are met less than 100 times # the hope that these won't be on top of the list is quite reliable sort -rn|head -$N # above line is the inefficient one I strive to address