How to remove part of a string in a multi-line fragment using sed or Perl?

I have some data that look like this. It comes in four parts. Each fragment begins with the @ symbol.

 @SRR037212.1 FC30L5TAA_102708:7:1:741:1355 length=27 AAAAAAAAAAAAAAAAAAAAAAAAAAA +SRR037212.1 FC30L5TAA_102708:7:1:741:1355 length=27 ::::::::::::::::::::::::;;8 @SRR037212.2 FC30L5TAA_102708:7:1:1045:1765 length=27 TATAACCAGAAAGTTACAAGTAAACAC +SRR037212.2 FC30L5TAA_102708:7:1:1045:1765 length=27 888888888888888888888888888 

In the third line of each fragment, I want to delete the text that appears after the + symbol, resulting in:

 @SRR037212.1 FC30L5TAA_102708:7:1:741:1355 length=27 AAAAAAAAAAAAAAAAAAAAAAAAAAA + ::::::::::::::::::::::::;;8 @SRR037212.2 FC30L5TAA_102708:7:1:1045:1765 length=27 TATAACCAGAAAGTTACAAGTAAACAC + 888888888888888888888888888 

Is there a compact way to do this in sed or Perl?

+4
source share
5 answers

If on the first or second lines there is never +, and always one in the third line:

 perl -0100pi -e's/\+.*/+/' datafile 

Otherwise:

 perl -0100pi -e's/^((?:.*\n){2}.*?\+).*/$1/' datafile 

or at 5.10 +:

 perl -0100pi -e's/^(?:.*\n){2}.*?\+\K.*//' datafile 

Everyone who assumes @ appears only at the beginning of the fragment. If it can appear in other places, then:

 perl -pi -e's/\+.*/+/ if $. % 4 == 3' datafile 
+3
source

Assuming you just don't want to blindly delete the rest of each line starting with + , you can do this:

 sed '/^@/{N;N;s/\n+.*/\n+/}' infile 

Output

 $ sed '/^@/{N;N;s/\n+.*/\n+/}' infile @SRR037212.1 FC30L5TAA_102708:7:1:741:1355 length=27 AAAAAAAAAAAAAAAAAAAAAAAAAAA + ::::::::::::::::::::::::;;8 @SRR037212.2 FC30L5TAA_102708:7:1:1045:1765 length=27 TATAACCAGAAAGTTACAAGTAAACAC + 888888888888888888888888888 +Dont remove me 

* Note. Although the above command keys are in @ to determine whether to change the line with + , it will still change the second line if it also starts with + . It doesn’t sound like that, but if you want to exclude this corner case, the following minor changes will protect against this:

 sed '/^@/{N;N;s/\(.*\)\n+.*/\1\n+/}' infile 

Output

 $ sed '/^@/{N;N;s/\(.*\)\n+.*/\1\n+/}' ./infile @SRR037212.1 FC30L5TAA_102708:7:1:741:1355 length=27 +AAAAAAAAAAAAAAAAAAAAAAAAAAA + ::::::::::::::::::::::::;;8 @SRR037212.2 FC30L5TAA_102708:7:1:1045:1765 length=27 TATAACCAGAAAGTTACAAGTAAACAC + 888888888888888888888888888 +Dont remove me 
+4
source

If you can use awk, you can do:

  gawk '{if ($0 ~ /^@/ ) { print ; getline ; print ; getline ; print "+"}}' INPUTFILE 

So, if gawk sees @ at the beginning of the line, it will be printed, then the next line will be split && & printed and finally the 3rd line (after @ ) will be punched and only + printed.

If + not at the beginning of the line, you can use gensub(/\+.*/,"+",$0) instead of "+" in the last print .

(And if you have perl installed, most likely there will be an a2p , which can convert the above awk script to perl if you want ...)

NTN

UPDATE (4th line missing):

  gawk '{if ($0 ~ /^@/ ) { print ; getline ; print ; getline ; print "+"; getline; print }}' INPUTFILE 

This should also print 4th line.

+1
source

maybe just sed '/^@/+2 s/+.*/+/'

edit : this will not work, but as a vim command it should work:

 vim file -c ':g/^@/+2s/+.*/+/' -c 'wq' 
0
source

This might work for you:

 sed '/^@/{$!N;$!N;$!N;s/\n+[^\n]*/\n+/g}' file 

or using GNU sed:

 sed '/^@/,+3s/^+.*/+/' file 
0
source

Source: https://habr.com/ru/post/1337265/


All Articles