How can I read unique terms in an unencrypted plaintext file?

It can be any high-level language that is likely to be available in a typical unix-like system (Python, Perl, awk, standard unix utils {sort, uniq}, etc.). Hopefully this is fast enough to report the total number of unique terms for a 2 MB text file.

I only need this for a quick health check, so it should not be well designed.

Remember, not case sensitive.

Thanks guys a lot.

Note: if you use Python, do not use the version 3 code. The system on which I run it has only 2.4.4.

+2
source share
8 answers

In Python 2.4 (it may work with earlier systems too):

#! /usr/bin/python2.4 import sys h = set() for line in sys.stdin.xreadlines(): for term in line.split(): h.add(term) print len(h) 

In Perl:

 $ perl -ne 'for (split(" ", $_)) { $H{$_} = 1 } END { print scalar(keys%H), "\n" }' <file.txt 
+4
source

In Perl:

 my %words; while (<>) { map { $words{lc $_} = 1 } split /\s/); } print scalar keys %words, "\n"; 
+6
source

Using bash / UNIX Commands:

 sed -e 's/[[:space:]]\+/\n/g' $FILE | sort -fu | wc -l 
+5
source

Using standard Unix utilities:

 < somefile tr 'AZ[:blank:][:punct:]' 'az\n' | sort | uniq -c 

If you use a system without Gnu tr , you will need to replace " [:blank:][:punct:] " with a list of all the space characters and punctuation that you would like to consider as word delimiters, and what part of the word, for example " \t.,; ".

If you want the result to be sorted in decreasing order of frequency, you can add " | sort -r -n " to the end of it.

Note that this will also result in an unnecessary value for the spotlight characters; if this bothers you, after tr you can use sed to filter out empty lines.

+4
source

Here is a single-line Perl:

 perl -lne '$h{lc $_}++ for split /[\s.,]+/; END{print scalar keys %h}' file.txt 

Or specify a list for each item:

 perl -lne '$h{lc $_}++ for split /[\s.,]+/; END{printf "%-12s %d\n", $_, $h{$_} for sort keys %h}' file.txt 

This makes an attempt to process punctuation to "foo". counts as "foo" and "do not" counts as one word, but you can customize the regular expression to suit your needs.

+4
source

Simple (52 strokes):

 perl -nE'@w{map lc,split/\W+/}=();END{say 0+keys%w}' 

For older versions of perl (55 strokes):

 perl -lne'@w{map lc,split/\W+/}=();END{print 0+keys%w}' 
+3
source

Shorter version in Python:

 print len(set(w.lower() for w in open('filename.dat').read().split())) 

Reads the entire file in memory, breaks it into words using spaces, converts each word to lower case, creates a (unique) set of lower lines, counts them and prints the output.

It is also possible to use one insert:

 python -c "print len(set(w.lower() for w in open('filename.dat').read().split()))" 
+3
source

Here is awk oneliner.

 $ gawk -v RS='[[:space:]]' 'NF&&!a[toupper($0)]++{i++}END{print i}' somefile 
  • 'NF' means if there is a character.
  • '! a [topuuer [$ 0] ++] means only uniq words show. "
0
source

Source: https://habr.com/ru/post/889886/


All Articles