How to remove part of a string in a multi-line fragment using sed or Perl?

Question

How to remove part of a string in a multi-line fragment using sed or Perl?

I have some data that look like this. It comes in four parts. Each fragment begins with the @ symbol.

 @SRR037212.1 FC30L5TAA_102708:7:1:741:1355 length=27 AAAAAAAAAAAAAAAAAAAAAAAAAAA +SRR037212.1 FC30L5TAA_102708:7:1:741:1355 length=27 ::::::::::::::::::::::::;;8 @SRR037212.2 FC30L5TAA_102708:7:1:1045:1765 length=27 TATAACCAGAAAGTTACAAGTAAACAC +SRR037212.2 FC30L5TAA_102708:7:1:1045:1765 length=27 888888888888888888888888888

In the third line of each fragment, I want to delete the text that appears after the + symbol, resulting in:

 @SRR037212.1 FC30L5TAA_102708:7:1:741:1355 length=27 AAAAAAAAAAAAAAAAAAAAAAAAAAA + ::::::::::::::::::::::::;;8 @SRR037212.2 FC30L5TAA_102708:7:1:1045:1765 length=27 TATAACCAGAAAGTTACAAGTAAACAC + 888888888888888888888888888

Is there a compact way to do this in sed or Perl?

+4

linux unix perl sed

neversaint Jan 27 '11 at 6:21

source share

5 answers

Assuming you just don't want to blindly delete the rest of each line starting with + , you can do this:

 sed '/^@/{N;N;s/\n+.*/\n+/}' infile

Output

 $ sed '/^@/{N;N;s/\n+.*/\n+/}' infile @SRR037212.1 FC30L5TAA_102708:7:1:741:1355 length=27 AAAAAAAAAAAAAAAAAAAAAAAAAAA + ::::::::::::::::::::::::;;8 @SRR037212.2 FC30L5TAA_102708:7:1:1045:1765 length=27 TATAACCAGAAAGTTACAAGTAAACAC + 888888888888888888888888888 +Dont remove me

* Note. Although the above command keys are in @ to determine whether to change the line with + , it will still change the second line if it also starts with + . It doesn’t sound like that, but if you want to exclude this corner case, the following minor changes will protect against this:

 sed '/^@/{N;N;s/\(.*\)\n+.*/\1\n+/}' infile

Output

 $ sed '/^@/{N;N;s/\(.*\)\n+.*/\1\n+/}' ./infile @SRR037212.1 FC30L5TAA_102708:7:1:741:1355 length=27 +AAAAAAAAAAAAAAAAAAAAAAAAAAA + ::::::::::::::::::::::::;;8 @SRR037212.2 FC30L5TAA_102708:7:1:1045:1765 length=27 TATAACCAGAAAGTTACAAGTAAACAC + 888888888888888888888888888 +Dont remove me

+4

Siegex Jan 27 '11 at 6:27

source share

If you can use awk, you can do:

  gawk '{if ($0 ~ /^@/ ) { print ; getline ; print ; getline ; print "+"}}' INPUTFILE

So, if gawk sees @ at the beginning of the line, it will be printed, then the next line will be split && & printed and finally the 3rd line (after @ ) will be punched and only + printed.

If + not at the beginning of the line, you can use gensub(/\+.*/,"+",$0) instead of "+" in the last print .

(And if you have perl installed, most likely there will be an a2p , which can convert the above awk script to perl if you want ...)

NTN

UPDATE (4th line missing):

  gawk '{if ($0 ~ /^@/ ) { print ; getline ; print ; getline ; print "+"; getline; print }}' INPUTFILE

This should also print 4th line.

+1

Zsolt Botykai Jan 27 '11 at 8:16

source share

maybe just sed '/^@/+2 s/+.*/+/'

edit : this will not work, but as a vim command it should work:

 vim file -c ':g/^@/+2s/+.*/+/' -c 'wq'

0

Benoit Jan 27 '11 at 6:26

source share

This might work for you:

 sed '/^@/{$!N;$!N;$!N;s/\n+[^\n]*/\n+/g}' file

or using GNU sed:

 sed '/^@/,+3s/^+.*/+/' file

0

potong Mar 18 '12 at 9:08

source share

ysth · Accepted Answer · 2011-01-27T06:29:01+0000

If on the first or second lines there is never +, and always one in the third line:

 perl -0100pi -e's/\+.*/+/' datafile

Otherwise:

 perl -0100pi -e's/^((?:.*\n){2}.*?\+).*/$1/' datafile

or at 5.10 +:

 perl -0100pi -e's/^(?:.*\n){2}.*?\+\K.*//' datafile

Everyone who assumes @ appears only at the beginning of the fragment. If it can appear in other places, then:

 perl -pi -e's/\+.*/+/ if $. % 4 == 3' datafile

How to remove part of a string in a multi-line fragment using sed or Perl?

Output

Output

More articles: