How can I read unique terms in an unencrypted plaintext file?

Question

How can I read unique terms in an unencrypted plaintext file?

It can be any high-level language that is likely to be available in a typical unix-like system (Python, Perl, awk, standard unix utils {sort, uniq}, etc.). Hopefully this is fast enough to report the total number of unique terms for a 2 MB text file.

I only need this for a quick health check, so it should not be well designed.

Remember, not case sensitive.

Thanks guys a lot.

Note: if you use Python, do not use the version 3 code. The system on which I run it has only 2.4.4.

+2

python unix awk perl count

Alex budovski May 27 '09 at 7:18

source share

8 answers

In Perl:

 my %words; while (<>) { map { $words{lc $_} = 1 } split /\s/); } print scalar keys %words, "\n";

+6

Christoffer May 27 '09 at 7:38

source share

Using bash / UNIX Commands:

 sed -e 's/[[:space:]]\+/\n/g' $FILE | sort -fu | wc -l

+5

Eduard - Gabriel Munteanu May 27 '09 at 7:34

source share

Using standard Unix utilities:

 < somefile tr 'AZ[:blank:][:punct:]' 'az\n' | sort | uniq -c

If you use a system without Gnu tr , you will need to replace " [:blank:][:punct:] " with a list of all the space characters and punctuation that you would like to consider as word delimiters, and what part of the word, for example " \t.,; ".

If you want the result to be sorted in decreasing order of frequency, you can add " | sort -r -n " to the end of it.

Note that this will also result in an unnecessary value for the spotlight characters; if this bothers you, after tr you can use sed to filter out empty lines.

+4

Curt J. Sampson May 27 '09 at 7:34

source share

Here is a single-line Perl:

 perl -lne '$h{lc $_}++ for split /[\s.,]+/; END{print scalar keys %h}' file.txt

Or specify a list for each item:

 perl -lne '$h{lc $_}++ for split /[\s.,]+/; END{printf "%-12s %d\n", $_, $h{$_} for sort keys %h}' file.txt

This makes an attempt to process punctuation to "foo". counts as "foo" and "do not" counts as one word, but you can customize the regular expression to suit your needs.

+4

jmcnamara May 27 '09 at 9:55

source share

Simple (52 strokes):

 perl -nE'@w{map lc,split/\W+/}=();END{say 0+keys%w}'

For older versions of perl (55 strokes):

 perl -lne'@w{map lc,split/\W+/}=();END{print 0+keys%w}'

+3

Hynek -Pichi- Vychodil May 27 '09 at 9:19

source share

Shorter version in Python:

 print len(set(w.lower() for w in open('filename.dat').read().split()))

Reads the entire file in memory, breaks it into words using spaces, converts each word to lower case, creates a (unique) set of lower lines, counts them and prints the output.

It is also possible to use one insert:

 python -c "print len(set(w.lower() for w in open('filename.dat').read().split()))"

+3

gooli May 30, '09 at 17:40

source share

Here is awk oneliner.

 $ gawk -v RS='[[:space:]]' 'NF&&!a[toupper($0)]++{i++}END{print i}' somefile

'NF' means if there is a character.
'! a [topuuer [$ 0] ++] means only uniq words show. "

0

Hirofumi saito May 27, '09 at 10:53

source share

pts · Accepted Answer · 2009-05-27T07:19:54+0000

In Python 2.4 (it may work with earlier systems too):

#! /usr/bin/python2.4 import sys h = set() for line in sys.stdin.xreadlines(): for term in line.split(): h.add(term) print len(h)

In Perl:

 $ perl -ne 'for (split(" ", $_)) { $H{$_} = 1 } END { print scalar(keys%H), "\n" }' <file.txt

How can I read unique terms in an unencrypted plaintext file?

More articles: