Best way to determine uniqueness and repetition in a text file

I have a text file of about 20 million lines. Each line is 25 characters long. I believe that, probably, about 200 thousand 300 thousand. Unique lines. I want to find out exactly how many unique lines there are and how many occurrences of each line (I expect the result to be exponential).

I could do this:

sort bigfile|uniq -c |sort -nr > uniqcounts
wc -l uniqcounts

but it is a terribly inefficient memory in time.

What is your best command line solution for this problem?

+3
source share
5 answers

Perl, , , Perl Unix-. (, awk, , , .)

- :

#!/usr/bin/perl

while(<>) {
    chomp;
    $lines{$_}++;
}

print "Total unique lines: ", scalar(keys %lines), "\n";
foreach my $line (sort {$lines{$b} <=> $lines{$a}} keys %lines) {
    printf "%6d  %s\n", $lines{$line}, $line;
}

( , .)

O (n) -, n - . , - O (n) ( -) O (n * log n) ( ). , O (n ^ 2) , .

+6

, , , .

20 * 25 = 500000000 ( , Unicode)

500 . .

, , . .

(, sqlite) .

CREATE TABLE lines (line VARCHAR(25), occurences INTEGER)

.

, , !

+2

, sort uniq:

export LC_ALL=C

, perl .

+1

awk ( nawk /usr/xpg 4/bin/awk Solaris:

awk 'END {
  for (k in _)
    print k, _[k]
    }
{ _[$0]++ }
' infile
+1

, , , : O (n log (n) + n). "sort -nr", , , , .

, - , , ( ). , , , , sort uniq.

0

Source: https://habr.com/ru/post/1704749/


All Articles