How to remove duplicate words from a string in a bash script?

Question

How to remove duplicate words from a string in a bash script?

I have a string containing duplicate words, for example:

abc, def, abc, def

How to remove duplicates? The line I need is:

 abc, def

+6

bash

Thanh tran May 18, '15 at 4:02

source share

4 answers

You can use awk for this.

Example:

 #!/bin/bash string="abc, def, abc, def" string=$(printf '%s\n' "$string" | awk -v RS='[,[:space:]]+' '!a[$0]++{printf "%s%s", $0, RT}') string="${string%,*}" echo "$string"

Output:

 abc, def

+3

Jahid May 18, '15 at 6:52

source share

This can also be done in pure Bash:

 #!/bin/bash string="abc, def, abc, def" declare -A words IFS=", " for w in $string; do words+=( [$w]="" ) done echo ${!words[@]}

Output

 def abc

Explanation

words is an associative array ( declare -A words ), and each word is added as a key to it:

 words+=( [${w}]="" )

(We do not need its value, so I accepted the value "" as the value).

A list of unique words is a list of keys ( ${!words[@]} ).

There is one caveat, the conclusion is not divided into ", " . (You will have to repeat it again. IFS used only with ${words[*]} and not only the first IFS character is used.)

+2

Micha wiedenmann May 18, '15 at 7:40

source share

I have another way for this case. I changed my input line, for example, below, and ran a command to edit it:

 #string="abc def abc def" $ echo "abc def abc def" | xargs -n1 | sort -u | xargs | sed "s# #, #g" abc, def

Thanks for the support!

+1

Thanh tran May 19 '15 at 5:55

source share

John1024 · Accepted Answer · 2015-05-18T04:55:40+0000

We have this test file:

 $ cat file abc, def, abc, def

To remove duplicate words:

 $ sed -r ':a; s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g; ta; s/(, )+/, /g; s/, *$//' file abc, def

How it works

:a
This defines the label a .
s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g
It searches for a duplicate word consisting of alphanumeric characters and removes the second occurrence.
ta
If the last substitution command led to a change, it will return to the a mark to try again.
Thus, the code continues to search for duplicates until it remains.
s/(, )+/, /g; s/, *$//
These two substitution commands clear any combinations to the left of the comma.

Mac OSX or another BSD system

For Mac OSX or another BSD system, try:

 sed -E -e ':a' -e 's/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g' -e 'ta' -e 's/(, )+/, /g' -e 's/, *$//' file

Using a string instead of a file

sed easily processes input either from a file, as shown above, or from a shell line, as shown below:

 $ echo 'ab, cd, cd, ab, ef' | sed -r ':a; s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g; ta; s/(, )+/, /g; s/, *$//' ab, cd, ef

How to remove duplicate words from a string in a bash script?

How it works

Mac OSX or another BSD system

Using a string instead of a file

More articles: