How to remove punctuation marks using awk?

Question

How to remove punctuation marks using awk?

I need a command line in the shell, which gives the text file “new” displays each word in one line along with the number of lines that matches, writing it to a file called “words”. The problem is that words cannot have punctuation marks. This is what I have

$ awk '{for(i=1; i<=NF; ++i) {printf $i "\t" NR "\n", $0 > "words"}}' novel

The file contains:

 $ cat novel ver a don Quijote, y ellas le defendían la puerta: -¿Qué quiere este mostrenco en esta casa?

Expected Result:

 ver 1 a 1 don 1 Quijote 1 ... puerta 1 Qué 2 ... casa 2

This is a very simple team for academic use, thanks in an advanced way.

+5

unix shell awk

Alex martinez Feb 08 '18 at 5:10

source share

2 answers

You can remove specific punctuation using the awk gsub function:

 awk '{ gsub(/["*^&()#@$,\.!?~;]/,"") for(i=1; i<=NF; ++i) {print $i "\t" NR "\n" > "words"} }' novel

Further information on this feature can be found here .

In addition, you do not need to use printf $i "\t" NR "\n", $0 , since in most cases only a decimal part will be printed (format). So I changed it to print , dropping the $0 element.

0

Andriy makukha Feb 08 '18 at 5:21

source share

John1024 · Accepted Answer · 2018-02-08T05:19:42+0000

Using awk

Try the following command:

 awk '{gsub(/[[:punct:]]/, "")} 1' RS='[[:space:]]' novel >words

As an example, consider this file:

 $ cat novel It was a "dark" and stormy night; the rain fell in torrents. $ awk '{gsub(/[[:punct:]]/, "")} 1' RS='[[:space:]]' novel It was a dark and stormy night the rain fell in torrents

Or, to save the output in a words file, use:

 awk '{gsub(/[[:punct:]]/, "")} 1' RS='[[:space:]]' novel >words

How it works:

gsub(/[[:punct:]]/, "")
This tells awk to find any punctuation and replace it with an empty string.
[:punct:] is a character class that includes all punctuation marks. This form includes all punctuation marks defined by unicode. Unicode defines, for example, many types of quotation mark characters. This will include them all.
1
This is short for awk for print-write.
RS='[[:space:]]'
This tells awk to use any sequence of spaces as a separator for entries. This means that each word defines a split record, and awk will read in one word as the processing time.

Word count

The usual approach for counting elements in Unix is to use sort and uniq -c as follows:

 $ echo 'one two two three three three' | awk '{gsub(/^[[:punct:]]|[[:punct:]]$/, "")} 1' RS='[[:space:]]' | sort | uniq -c 1 one 3 three 2 two

Alternatively awk can do everything:

 $ echo 'one two two three three three' | awk '{gsub(/^[[:punct:]]|[[:punct:]]$/, ""); a[$0]++} END{for (w in a) print w,a[w]}' RS='[[:space:]]' three 3 two 2 one 1

Alternative awk method

Andrei Makukha suggests that we do not want to remove punctuation from a single word, like a single quote in I've . Similarly, we do not want to remove periods from the URL so that google.com remains google.com . To remove punctuation only at the beginning or end of a word, we replaced the gsub command with:

 gsub(/^[[:punct:]]|[[:punct:]]$/, "")

For instance:

 $ echo "I've got 'google.com'" | awk '{gsub(/^[[:punct:]]|[[:punct:]]$/, "")} 1' RS='[[:space:]]' I've got google.com

Using sed

This sed command will remove all punctuation and put each word on a separate line:

 sed 's/[[:punct:]]//g; s/[[:space:]]/\n/g' novel

If we run our team, we get:

 $ sed 's/[[:punct:]]//g; s/[[:space:]]/\n/g' novel It was a dark and stormy night the rain fell in torrents

If you want to save the words in the words file, try:

 sed 's/[[:punct:]]//g; s/[[:space:]]/\n/g' novel >words

__ How it works: _

s/[[:punct:]]//g
This tells sed to find any punctuation occurrence and replace it with nothing. Again, we use [:punct:] because it will handle all the punctuation characters defined in Unicode.
s/[[:space:]]/\n/g
This tells sed to find any sequence of spaces and replace it with one new line.

How to remove punctuation marks using awk?

Using awk

Word count

Alternative awk method

Using sed

More articles: