Best way to determine uniqueness and repetition in a text file

Question

Best way to determine uniqueness and repetition in a text file

I have a text file of about 20 million lines. Each line is 25 characters long. I believe that, probably, about 200 thousand 300 thousand. Unique lines. I want to find out exactly how many unique lines there are and how many occurrences of each line (I expect the result to be exponential).

I could do this:

sort bigfile|uniq -c |sort -nr > uniqcounts
wc -l uniqcounts

but it is a terribly inefficient memory in time.

What is your best command line solution for this problem?

+3

command-line unix shell

jedberg Mar 13 '09 at 20:55

source share

5 answers

, , , .

20 * 25 = 500000000 ( , Unicode)

500 . .

, , . .

(, sqlite) .

CREATE TABLE lines (line VARCHAR(25), occurences INTEGER)

.

, , !

+2

Anonymous 13 . '09 22:56

, sort uniq:

export LC_ALL=C

, perl .

+1

pixelbeat 13 . '09 22:20

awk ( nawk /usr/xpg 4/bin/awk Solaris:

awk 'END {
  for (k in _)
    print k, _[k]
    }
{ _[$0]++ }
' infile

+1

Dimitre Radoulov 14 . '09 10:17

, , , : O (n log (n) + n). "sort -nr", , , , .

, - , , ( ). , , , , sort uniq.

0

slacy 13 . '09 20:59

Commodore Jaeger · Accepted Answer · 2009-03-13T21:03:52+0000

Perl, , , Perl Unix-. (, awk, , , .)

- :

#!/usr/bin/perl

while(<>) {
    chomp;
    $lines{$_}++;
}

print "Total unique lines: ", scalar(keys %lines), "\n";
foreach my $line (sort {$lines{$b} <=> $lines{$a}} keys %lines) {
    printf "%6d  %s\n", $lines{$line}, $line;
}

( , .)

O (n) -, n - . , - O (n) ( -) O (n * log n) ( ). , O (n ^ 2) , .

Best way to determine uniqueness and repetition in a text file

More articles: