Using awk
Try the following command:
awk '{gsub(/[[:punct:]]/, "")} 1' RS='[[:space:]]' novel >words
As an example, consider this file:
$ cat novel It was a "dark" and stormy night; the rain fell in torrents. $ awk '{gsub(/[[:punct:]]/, "")} 1' RS='[[:space:]]' novel It was a dark and stormy night the rain fell in torrents
Or, to save the output in a words file, use:
awk '{gsub(/[[:punct:]]/, "")} 1' RS='[[:space:]]' novel >words
How it works:
gsub(/[[:punct:]]/, "")
This tells awk to find any punctuation and replace it with an empty string.
[:punct:] is a character class that includes all punctuation marks. This form includes all punctuation marks defined by unicode. Unicode defines, for example, many types of quotation mark characters. This will include them all.
1
This is short for awk for print-write.
RS='[[:space:]]'
This tells awk to use any sequence of spaces as a separator for entries. This means that each word defines a split record, and awk will read in one word as the processing time.
Word count
The usual approach for counting elements in Unix is to use sort and uniq -c as follows:
$ echo 'one two two three three three' | awk '{gsub(/^[[:punct:]]|[[:punct:]]$/, "")} 1' RS='[[:space:]]' | sort | uniq -c 1 one 3 three 2 two
Alternatively awk can do everything:
$ echo 'one two two three three three' | awk '{gsub(/^[[:punct:]]|[[:punct:]]$/, ""); a[$0]++} END{for (w in a) print w,a[w]}' RS='[[:space:]]' three 3 two 2 one 1
Alternative awk method
Andrei Makukha suggests that we do not want to remove punctuation from a single word, like a single quote in I've . Similarly, we do not want to remove periods from the URL so that google.com remains google.com . To remove punctuation only at the beginning or end of a word, we replaced the gsub command with:
gsub(/^[[:punct:]]|[[:punct:]]$/, "")
For instance:
$ echo "I've got 'google.com'" | awk '{gsub(/^[[:punct:]]|[[:punct:]]$/, "")} 1' RS='[[:space:]]' I've got google.com
Using sed
This sed command will remove all punctuation and put each word on a separate line:
sed 's/[[:punct:]]//g; s/[[:space:]]/\n/g' novel
If we run our team, we get:
$ sed 's/[[:punct:]]//g; s/[[:space:]]/\n/g' novel It was a dark and stormy night the rain fell in torrents
If you want to save the words in the words file, try:
sed 's/[[:punct:]]//g; s/[[:space:]]/\n/g' novel >words
__ How it works: _
s/[[:punct:]]//g
This tells sed to find any punctuation occurrence and replace it with nothing. Again, we use [:punct:] because it will handle all the punctuation characters defined in Unicode.
s/[[:space:]]/\n/g
This tells sed to find any sequence of spaces and replace it with one new line.