"quick fetch" (or similar) implementation on Linux? (instead of sort | uniq -c | sort -rn | head - $ N)

Question

"quick fetch" (or similar) implementation on Linux? (instead of sort | uniq -c | sort -rn | head - $ N)

PROBLEM: Often I am faced with the need to see what are the most frequently repeated "patterns" during the last day of certain magazines. For example, for a small subset of tomcat logs:

GET /app1/public/pkg_e/v3/555413242345562/account/stats 401 954 5 GET /app1/public/pkg_e/v3/555412562561928/account/stats 200 954 97 GET /app1/secure/pkg_e/v3/555416251626403/ex/items/ 200 517 18 GET /app1/secure/pkg_e/v3/555412564516032/ex/cycle/items 200 32839 50 DELETE /app1/internal/pkg_e/v3/accounts/555411543532089/devices/bbbbbbbb-cccc-2000-dddd-43a8eabcdaa0 404 - 1 GET /app1/secure/pkg_e/v3/555412465246556/sessions 200 947 40 GET /app1/public/pkg_e/v3/555416264256223/account/stats 401 954 4 GET /app2/provisioning/v3/555412562561928/devices 200 1643 65 ...

If I want to find out the most commonly used URLs (along with the method and retcode) I will do:

 [ root@srv112 :~]$ N=6;cat test|awk '{print $1" "$2" ("$3")"}'\ |sed 's/[0-9a-f-]\+ (/%GUID% (/;s/\/[0-9]\{4,\}\//\/%USERNAME%\//'\ |sort|uniq -c|sort -rn|head -$N 4 GET /app1/public/pkg_e/v3/%USERNAME%/account/stats (401) 2 GET /app1/secure/pkg_e/v3/%USERNAME%/devices (200) 2 GET /app1/public/pkg_e/v3/%USERNAME%/account/stats (200) 2 DELETE /app1/internal/pkg_e/v3/accounts/%USERNAME%/devices/%GUID% (404) 1 POST /app2/servlet/handler (200) 1 POST /app1/servlet/handler (200)

If I want to find out the most common username from the same file, I will do:

 [ root@srv112 :~]$ N=4;cat test|grep -Po '(?<=\/)[0-9]{4,}(?=\/)'\ |sort|uniq -c|sort -rn|head -$N 9 555412562561928 2 555411543532089 1 555417257243373 1 555416264256223

The above works quite well on small data sets, but for larger input sets the performance (complexity) sort|uniq -c|sort -rn|head -$N unbearable (we are talking about ~ 100 servers, ~ 250 log files per server, ~ 1 million lines for each log file)

ATTEMPT TO SOLVE: |sort|uniq -c part is easily replaced by awk 1-liner slot, turning it into:

 |awk '{S[$0]+=1}END{for(i in S)print S[i]"\t"i}'|sort -rn|head -$N

but I could not find a standard / simple and economical implementation of the "quick pick algorithm" (discussed here ) to optimize the |sort -rn|head -$N . There was a search for GNU files, rpms, awk-1-liners, or some easily compiled Ansi C code that I could transfer / distribute through data centers to include:

 3 tasty oranges 225 magic balls 17 happy dolls 15 misty clouds 93 juicy melons 55 rusty ideas ...

in (at N = 3):

 225 magic balls 93 juicy melons 55 rusty ideas

I could probably take a sample of Java code and transfer it to the format above stdin (by the way, I was surprised by the lack of .quickselect(...) in the kernel core), but the need to deploy java-runtime is not attractive everywhere. Perhaps I could take a piece of code based on an array (based on an array), and then adapt it to the stdin format above, and then check and fix the leaks, etc. For a while. Or even implement it from scratch in awk. BUT (!) - this simple need probably comes across more than 1% of people on a regular basis - should there be a standard (pre-tested) implementation there? I hope ... maybe I'm using the wrong keywords to find it ...

OTHER OBSTACLES: Also encountered several problems for working with large data sets:

log files are located on NFS volumes on which ~ 100 servers are installed, so it made sense to parallelize and divide the work into smaller pieces.
higher awk '{S[$0]+=1}... memory is required - I see that it dies when it eats up to 16 GB (despite the presence of 48 GB of free RAM and a lot of swap ... maybe some linus limit, which I forgot)

My current solution is still not reliable and not optimal (in the process) as follows:

 find /logs/mount/srv*/tomcat/2013-09-24/ -type f -name "*_22:*"|\ # TODO: reorder 'find' output to round-robin through srv1 srv2 ... # to help 'parallel' work with multiple servers at once parallel -P20 $"zgrep -Po '[my pattern-grep regexp]' {}\ |awk '{S[\$0]+=1} END{for(i in S)if(S[i]>4)print \"count: \"S[i]\"\\n\"i}'" # I throw away patterns met less than 5 times per log file # in hope those won't pop on top of result list anyway - bogus # but helps to address 16GB-mem problem for 'awk' below awk '{if("count:"==$1){C=$2}else{S[$0]+=C}} END{for(i in S)if(S[i]>99)print S[i]"\t"i}'|\ # I also skip all patterns which are met less than 100 times # the hope that these won't be on top of the list is quite reliable sort -rn|head -$N # above line is the inefficient one I strive to address

+6

sorting algorithm linux awk quickselect

Vlad Oct 17 '13 at 18:26

source share

1 answer

Daniel Frey · Answer 1 · 2013-10-17T23:00:50+0000

I'm not sure if your own little tool is right for you, but you can easily write a small tool to replace |sort|uniq -c|sort -rn|head -$N -part with |sort|quickselect $N The advantage of the tool is that it reads the output from the first sort only once, in turn, and does not contain a large amount of data in memory. In fact, only the memory is needed to store the current line and the top lines of $N

Here's the quickselect.cpp source:

 #include <iostream> #include <string> #include <map> #include <cstdlib> #include <cassert> typedef std::multimap< std::size_t, std::string, std::greater< std::size_t > > winner_t; winner_t winner; std::size_t max; void insert( int count, const std::string& line ) { winner.insert( winner_t::value_type( count, line ) ); if( winner.size() > max ) winner.erase( --winner.end() ); } int main( int argc, char** argv ) { assert( argc == 2 ); max = std::atol( argv[1] ); assert( max > 0 ); std::string current, last; std::size_t count = 0; while( std::getline( std::cin, current ) ) { if( current != last ) { insert( count, last ); count = 1; last = current; } else ++count; } if( count ) insert( count, current ); for( winner_t::iterator it = winner.begin(); it != winner.end(); ++it ) std::cout << it->first << " " << it->second << std::endl; }

to compile with:

 g++ -O3 quickselect.cpp -o quickselect

Yes, I understand that you are asking for ready-made solutions, but I do not know anything that would be equally effective. And the above is so simple, there is hardly any margin for errors (given that you have not ruined one parameter of the numerical command line :)

"quick fetch" (or similar) implementation on Linux? (instead of sort | uniq -c | sort -rn | head - $ N)

More articles: