Delete single duplicate rows without sorting

I have a text file containing about 5000 lines, I need to delete certain duplicate lines (which do not contain the words "Niveau" or "stime"), but keeping the first occurrence and without sorting, the text template looks like this:

vide vide Time: stime 3:30 PM vide vide NN NN NP stime LS NP NN NN ----------Niveau 1-------------- Time: | 0 | 263.0 | 266.0 | 0,0113 NP | 0 | 0.0 | 24885.0 | 1 3:30 | -0 | 104.0 | 120.0 | 0,1333 LS | -0 | 0.0 | 13134.0 | 1 PM | -1 | 134.0 | 238.0 | 0,437 NP | -1 | 0.0 | 24885.0 | 1 ----------Niveau 2-------------- 3:30 PM | -0 | 30.0 | 41.0 | 0,2683 3:30 NP | -0 | 133.0 | 55.0 | -1,4182 LS PM | -0 | 42.0 | 237.0 | 0,8228 LS NP | -0 | 0.0 | 2456.0 | 1 ----------Niveau 3-------------- vide vide Time: stime 3:30 pm vide vide NN NN NP stime LS NN NN NN ----------Niveau 1-------------- Time: | 0 | 263.0 | 266.0 | 0,0113 NP | 0 | 0.0 | 24885.0 | 1 3:30 | -0 | 104.0 | 120.0 | 0,1333 LS | -0 | 0.0 | 13134.0 | 1 pm | -1 | 38.0 | 54.0 | 0,2963 NN | -1 | 0.0 | 59511.0 | 1 ----------Niveau 2-------------- 3:30 pm | -0 | 9.0 | 9.0 | 0 3:30 NN | -0 | 36.0 | 24.0 | -0,5 LS pm | -0 | 22.0 | 52.0 | 0,5769 LS NN | -0 | 0.0 | 2658.0 | 1 ----------Niveau 3-------------- 

Expected results:

 vide vide Time: stime 3:30 PM vide vide NN NN NP stime LS NP NN NN ----------Niveau 1-------------- Time: | 0 | 263.0 | 266.0 | 0,0113 NP | 0 | 0.0 | 24885.0 | 1 3:30 | -0 | 104.0 | 120.0 | 0,1333 LS | -0 | 0.0 | 13134.0 | 1 PM | -1 | 134.0 | 238.0 | 0,437 NP | -1 | 0.0 | 24885.0 | 1 ----------Niveau 2-------------- 3:30 PM | -0 | 30.0 | 41.0 | 0,2683 3:30 NP | -0 | 133.0 | 55.0 | -1,4182 LS PM | -0 | 42.0 | 237.0 | 0,8228 LS NP | -0 | 0.0 | 2456.0 | 1 ----------Niveau 3-------------- vide vide Time: stime 3:30 pm vide vide NN NN NP stime LS NN NN NN ----------Niveau 1-------------- pm | -1 | 38.0 | 54.0 | 0,2963 NN | -1 | 0.0 | 59511.0 | 1 ----------Niveau 2-------------- 3:30 pm | -0 | 9.0 | 9.0 | 0 3:30 NN | -0 | 36.0 | 24.0 | -0,5 LS pm | -0 | 22.0 | 52.0 | 0,5769 LS NN | -0 | 0.0 | 2658.0 | 1 ----------Niveau 3-------------- 

Using the Notepad ++ and TextFX plugin, I hide lines containing the words "Niveau" and "stime", and then I use this regex ^(.*?)$\s+?^(?=.*^\1$) in the search and replace dialog, as suggested in the second solution to this post , when I click "replace all", it deletes all the lines, I get empty text, am I something wrong?

+5
source share
3 answers

You will need the ability to create scripts because there is no way to remove a duplicate row without pushing the matching position to that row.

Therefore, you will have to sit in a loop, rebooting from the beginning until all duplicates are deleted.

Perl while ( str ~= s/regex/$1/g ) {} example while ( str ~= s/regex/$1/g ) {}

It can be done. It may take a little extra time, but it is doable.

Anyway, this is a regular expression that you will need to do.

Globally:
Find (?m)((^[^\S\r\n]*?(?=\S)(?:(?!Niveau|stime).)+$)[\S\s]*?)^\2$(?:\r?\n)?
Replace $1

Do this until there are no more globally matches (i.e. replacements)

Clarification:

  (?m) # Multi-line mode ( # (1 start), To be written back ( # (2 start), The line to test ^ # BOL begin of line [^\S\r\n]*? # Spurious horizontal whitespace (?= \S ) # Must be a non-whitespace ahead (?: # Skip lines containing these (?! Niveau | stime ) . )+ $ # EOL end of line ) # (2 end) [\S\s]*? # Anything up to the duplicate ) # (1 end) ^ \2 $ # The actual duplicate line (?: \r? \n )? # Optional linebreak (if last line, then ok) 

Note that there is now a regular expression, no cropping of horizontal spaces
on BOL and EOL, so the text must be accurate.
It is easy, however, to add additional trim if necessary.


update

A faster version of the above regular expression uses the \K construct to simplify the replacement.

Globally:

Find (?m)(^[^\S\r\n]*?(?=\S)(?:(?!Niveau|stime).)+$)[\S\s]*?\K^\1$(?:\r?\n)?
Replace '' (nothing)

Explanation

  (?m) # Multi-line mode ( # (1 start), The line to test ^ # BOL begin of line [^\S\r\n]*? # Spurious horizontal whitespace (?= \S ) # Must be a non-whitespace ahead (?: # Skip lines containing these (?! Niveau | stime ) . )+ $ # EOL end of line ) # (1 end) [\S\s]*? # Anything up to the duplicate \K # Disregard the match up to here ^ \1 $ # The actual duplicate line to be deleted (?: \r? \n )? # Optional linebreak (if last line, then ok) 
+3
source

Below the regular expression works fine BUT, to make it work, you need to press the replace button as many times as many repetitions . For example, in the general OP example, there are 4 such lines that need replacing, so you need to press the replace button 4 times. I understand that this may not be an effective solution for large files, but this is my best attempt to resolve this issue.

 ^(?!(?:\s*$|.*(?:Niveau|stime)))(.*$)([\s\S]*?)(\1\s*) 

Replace matches \1\2

Here is a demo that illustrates replacing only the 1st repeating row. You need to repeat this replacement several times in order to get rid of everyone waiting for the first of each repeated line.

Regex Explanation:

  • ^ - sets the beginning of a line
  • ^(?!(?:\s*$|.*(?:Niveau|stime))) is a negative lookahead to make sure the string is not an empty string or the string does not contain the words Niveau or stime
  • (.*$) - matches and captures the contents of a line in group 1. In group 1, we try to capture a line that may have repetitions somewhere later in the file.
  • ([\s\S]*?) - corresponds to 0 + occurrences of any character, as far as possible, and commits it as a group 2
  • (\1\s*) - matches the contents of group 1, followed by 0+ spaces. If such a match is present, write it to group 3. We need to discard the contents of group 3 from the file, since this is nothing more than repeating the line captured in group 1.

I can better explain this with a few screenshots below :

Before performing at least one replacement, my file looks like this:

enter image description here

We need to delete lines A , B , C and D Since there are 4 such lines, we have to click on the replace button 4 times, as shown in the following screenshots.


After clicking on the replacement for the first time, line A is deleted and only B , C and D remain

enter image description here


After clicking on the replacement a second time, line B also deleted, and only lines C and D remain, as shown below:

enter image description here


After clicking on the replacement a third time, line C also deleted and only line D remains.

enter image description here


After clicking on the replacement for the 4th time, row D also deleted, and such duplicate rows do not remain

enter image description here

+2
source

using awk

  awk '(a[$0]++==0)||(/Nivea|stime/)' file 
  • (a[$0]++==0) - a[$0] (a dictionary with the name a with the string key of the string), ++ increment value by 1 (by default the value was not initialized eq 0), ==0 - check that the first time $0 (string) was marked (the value is updated / increased after checking the status)

  • (/Nivea|stime/) - the line includes one of the words "Nivea" or "stime" in the list

  • || if one of 1 or 2 is analyzed by the true line, it will be printed on the screen

+1
source

Source: https://habr.com/ru/post/1275286/


All Articles