How to generate a list of (unique) words from a text file in ubuntu?

I have an ASCII text file. I want to generate a list of all the "words" from this file using one or more Ubuntu commands. A word is defined as an alpha-num sequence between delimiters. Separators are spaces by default, but I also want to experiment with other characters like punctuation, etc. In other words, I want to specify a set of char delimiters. How can I create only a unique set of words? What if I also want to list only those words whose length is at least N characters?

+8
source share
3 answers

You can use grep:

-E '\ w +' searches for the words -o only prints the part of the line that matches% cat temp Some examples use "Fast brown fox jumped over a lazy dog" rather than "Lorem ipsum dolor sit amet, consectetur adipiscing elit" for example text.

if you don't care if the words are repeated

% grep -o -E '\w+' temp Some examples use The quick brown fox jumped over the lazy dog rather than Lorem ipsum dolor sit amet consectetur adipiscing elit for example text 

If you only want to print each word once, apart from the case, you can use sort

-u only prints each word once -f tells sort to ignore case when comparing words

if you need only one word

 % grep -o -E '\w+' temp | sort -u -f adipiscing amet brown consectetur dog dolor elit example examples for fox ipsum jumped lazy Lorem over quick rather sit Some text than The use 

you can also use tr command

 echo the quick brown fox jumped over the lazydog | tr -cs 'a-zA-Z0-9' '\n' the quick brown fox jumped over the lazydog 

-c intended to complement the specified characters; -s extrudes duplicate substitutions; "a-zA-Z0-9" is a set of alphanumeric characters, if you add a character here, the input will not be divided into this character (see another example below); "\ n" is the replacement character (new line).

 echo the quick brown fox jumped over the lazy-dog | tr -cs 'a-zA-Z0-9-' '\n' the quick brown fox jumped over the lazy-dog 

As we added '-' to the list without separators, a lazy dog ​​was printed. Another conclusion is

 echo the quick brown fox jumped over the lazy-dog | tr -cs 'a-zA-Z0-9' '\n' the quick brown fox jumped over the lazy dog 

Summary for tr: any character that does not contain the -c argument will act as a delimiter. I hope this also solves the problem with the delimiter.

+19
source

This should work for you:

 tr \ \\t\\v\\f\\r \\n | | tr -s \\n | tr -dc a-zA-Z0-9\\n | LC_ALL=C sort | uniq 

If you want characters to be at least five characters long, skip the output of tr through grep ..... If you want case insensitivity, stick with tr AZ az somewhere in the pipeline until sort .

Note that LC_ALL=C requires sort work correctly.

I would recommend reading the man pages for ant commands that you don't understand here.

0
source

Here my word -c is loud as a chain

cat myfile | grep -o -E '\w+' | tr '[AZ]' '[az]' | sort | uniq -c | sort -nr

if you have a text file, replace cat with detex :

detex myfile | grep -o -E '\w+' | tr '[AZ]' '[az]' | sort | uniq -c | sort -nr

0
source

Source: https://habr.com/ru/post/1480163/


All Articles