Removing duplicate rows based on partial string comparison

Question

Removing duplicate rows based on partial string comparison

I have a text file containing thousands of lines of text, as shown below.

123 hello world 124 foo bar 125 hello world

I would like to test duplicates by checking the row subsection. For the above, he should output:

 123 hello world 124 foo bar

Is there a vim command that can do this?

Refresh . I am on a Windows machine, so I can not use uniq

+4

vim

Bruno Nov 06 '12 at 15:17

source share

4 answers

Using awk:

 $ awk '!a[$2$3]++' file 123 hello world 124 foo bar

The first element, when entering the array, sets the count as 1, and therefore, further occurrences do not enter the array, since negation makes it false.

+1

Guru Nov 06 '12 at 15:37

source share

In VIM, I was able to sort and delete duplicates with the following command

 :sort u

+1

carlo.polisini Feb 17 '16 at 14:12

source share

I'm not sure about vim, but you can do something with the uniq command. It has the -skip-fields argument, which can be used to skip the first part of each line.

 $ cat test.txt 123 hello world 124 foo bar 125 hello world $ cat test.txt | sort -k 2 | uniq --skip-fields=1 | sort 123 hello world 124 foo bar

0

timwoj Nov 06 '12 at 15:25

source share

kev · Accepted Answer · 2012-11-06T15:24:30+0000

This is the bash command:

 sort -k2 input | uniq -s4

sort -k2 skip the first field when sorting
uniq -s4 skip the next 4 characters

In vim, you can invoke the external command above:

 :%!sort -k2 % | uniq -s4

the second % will expand to the current file name.

Actually, you can sort in vim with this command:

 :sort /^\d*\s/

vim will skip matching numbers when sorting

After sorting, use this command to remove duplicate rows:

 :%s/\v(^\d*\s(.*)$\n)(^\d*\s\2$\n)+/\1/

To avoid too much backslash, I use \v in the pattern to enable VERY MAGIC .
In a multi-line pattern, $ will match the position before the newline character ( \n ). I do not think this is necessary here.
You can create your own regular expression.

Removing duplicate rows based on partial string comparison

More articles: