How to remove punctuation marks using awk?

I need a command line in the shell, which gives the text file “new” displays each word in one line along with the number of lines that matches, writing it to a file called “words”. The problem is that words cannot have punctuation marks. This is what I have

$ awk '{for(i=1; i<=NF; ++i) {printf $i "\t" NR "\n", $0 > "words"}}' novel 

The file contains:

 $ cat novel ver a don Quijote, y ellas le defendían la puerta: -¿Qué quiere este mostrenco en esta casa? 

Expected Result:

 ver 1 a 1 don 1 Quijote 1 ... puerta 1 Qué 2 ... casa 2 

This is a very simple team for academic use, thanks in an advanced way.

+5
source share
2 answers

Using awk

Try the following command:

 awk '{gsub(/[[:punct:]]/, "")} 1' RS='[[:space:]]' novel >words 

As an example, consider this file:

 $ cat novel It was a "dark" and stormy night; the rain fell in torrents. $ awk '{gsub(/[[:punct:]]/, "")} 1' RS='[[:space:]]' novel It was a dark and stormy night the rain fell in torrents 

Or, to save the output in a words file, use:

 awk '{gsub(/[[:punct:]]/, "")} 1' RS='[[:space:]]' novel >words 

How it works:

  • gsub(/[[:punct:]]/, "")

    This tells awk to find any punctuation and replace it with an empty string.

    [:punct:] is a character class that includes all punctuation marks. This form includes all punctuation marks defined by unicode. Unicode defines, for example, many types of quotation mark characters. This will include them all.

  • 1

    This is short for awk for print-write.

  • RS='[[:space:]]'

    This tells awk to use any sequence of spaces as a separator for entries. This means that each word defines a split record, and awk will read in one word as the processing time.

Word count

The usual approach for counting elements in Unix is ​​to use sort and uniq -c as follows:

 $ echo 'one two two three three three' | awk '{gsub(/^[[:punct:]]|[[:punct:]]$/, "")} 1' RS='[[:space:]]' | sort | uniq -c 1 one 3 three 2 two 

Alternatively awk can do everything:

 $ echo 'one two two three three three' | awk '{gsub(/^[[:punct:]]|[[:punct:]]$/, ""); a[$0]++} END{for (w in a) print w,a[w]}' RS='[[:space:]]' three 3 two 2 one 1 

Alternative awk method

Andrei Makukha suggests that we do not want to remove punctuation from a single word, like a single quote in I've . Similarly, we do not want to remove periods from the URL so that google.com remains google.com . To remove punctuation only at the beginning or end of a word, we replaced the gsub command with:

 gsub(/^[[:punct:]]|[[:punct:]]$/, "") 

For instance:

 $ echo "I've got 'google.com'" | awk '{gsub(/^[[:punct:]]|[[:punct:]]$/, "")} 1' RS='[[:space:]]' I've got google.com 

Using sed

This sed command will remove all punctuation and put each word on a separate line:

 sed 's/[[:punct:]]//g; s/[[:space:]]/\n/g' novel 

If we run our team, we get:

 $ sed 's/[[:punct:]]//g; s/[[:space:]]/\n/g' novel It was a dark and stormy night the rain fell in torrents 

If you want to save the words in the words file, try:

 sed 's/[[:punct:]]//g; s/[[:space:]]/\n/g' novel >words 

__ How it works: _

  • s/[[:punct:]]//g

    This tells sed to find any punctuation occurrence and replace it with nothing. Again, we use [:punct:] because it will handle all the punctuation characters defined in Unicode.

  • s/[[:space:]]/\n/g

    This tells sed to find any sequence of spaces and replace it with one new line.

+3
source

You can remove specific punctuation using the awk gsub function:

 awk '{ gsub(/["*^&()#@$,\.!?~;]/,"") for(i=1; i<=NF; ++i) {print $i "\t" NR "\n" > "words"} }' novel 

Further information on this feature can be found here .

In addition, you do not need to use printf $i "\t" NR "\n", $0 , since in most cases only a decimal part will be printed (format). So I changed it to print , dropping the $0 element.

0
source

Source: https://habr.com/ru/post/1275230/


All Articles