Removing blocks of text from a huge text file

Question

Removing blocks of text from a huge text file

I was asked something rather painful, and I was wondering if anyone could help.

Our vendor provided the mib file for the SNMP (txt) file. Unfortunately, a lot of this file is outdated and needs to be deleted for our monitoring application.

I try to do it manually, but it's over 800,000 lines, and it spoils my will to live.

The structure looks something like this:

-- /*********************************************************************************/ -- /* MIB table for Hardware */ -- /* Valid from: 543.44 */ -- /* Deprecated from: 600.3 */ -- /*********************************************************************************/ Some text some text Some text -- /*********************************************************************************/ -- /* MIB table for Hardware */ -- /* Valid from: 543.44 */ -- /*********************************************************************************/ Some text some text Some text -- /*********************************************************************************/ -- /* MIB table for Hardware */ -- /* Valid from: 364.44 */ -- /* Deprecated from: 594.3 */ -- /*********************************************************************************/

Repeated randomly and declared

I think this is a script that:

find the text "Deprecated with" then

 delete that line, delete the preceding 3 lines, delete the following one line, delete then all following lines until the next "-- /*********************************************************************************/"

It makes sense? Is this possible, or am I just dreaming?

Thankyou!

+6

awk sed

Laptopgrrl Feb 01 '12 at 12:52

source share

4 answers

Dan fego · Answer 1 · 2012-02-01T01:16:52+0000

Edit: I just realized that I read your question incorrectly, even after it was supported several times. My answer was off before! Now it should be more correct, but with some additional assumptions . Simple solutions can bring you so far!

This may help you with a few assumptions:

 cat -s data | awk -vFS='\n' -vRS='\n\n' '/Deprecated from/ { getline; next } 1'

The cat just squeezes out the extra lines, so awk can work more easily. As for awk , -vFS='\n' tells him that the fields are separated by newlines, and -vRS='\n\n' says that the records are separated by two newlines in a line. Then /Deprecated from/ finds entries that have this text, and { getline; next } { getline; next } reads in the next record after it and makes it move on. 1 is a shortcut for printing lines that reach the next point.

The assumption is as follows :

All comments and text blocks are separated by at least one blank line on both sides.
Only comment blocks and text blocks alternate evenly
There are no empty lines in text blocks.

So this may not be perfect for you. If these assumptions are in order, this makes awk good choice for this job, as you can see: the script is tiny!

 $ cat -s data | awk -vFS='\n' -vRS='\n\n' '/Deprecated from/ { getline; next } 1' -- /*********************************************************************************/ -- /* MIB table for Hardware */ -- /* Valid from: 543.44 */ -- /*********************************************************************************/ Some text some text Some text

In addition, as you can see, new lines that remain are pushed. To help you, you can change the command as follows:

 $ cat -s data | awk -vFS='\n' -vRS='\n\n' '/Deprecated from/ { getline; next } { printf "%s\n\n", $0 }' -- /*********************************************************************************/ -- /* MIB table for Hardware */ -- /* Valid from: 543.44 */ -- /*********************************************************************************/ Some text some text Some text

potong · Answer 2 · 2012-02-01T14:32:11+0000

This might work for you:

  sed '$!N;$!N;:a;$q;N;/Deprecated from/!{P;s/^[^\n]*\n//;ba};$d;$!N;$d;s/.*//;:b;$d;N;/^\n-- \/\*\+\/$/!{s/.*//;bb};D' file

This is a bit simpler (less efficient, as it takes 2 passes):

 awk '/Deprecated from/{a=NR-3;getline;next};a>0 && /^-- \/\*+\/$/{b=NR-1;print a "," b "d";a=b=0};END{if(a>0)print a ",$d"}' file | sed -f - file

wallyk · Answer 3 · 2012-02-01T01:21:54+0000

This is a simple vim macro.

Raise the file: $ vim filename
Press qa to write the macro in register a
Type /Deprecated from: and then Enter (to search for text)
3k (to go up 3 lines)
4dd (delete this line and the next 3)
d/\*\*\*\*\*\* (to delete lines to icons)
(if necessary) press dd (to delete the current line)
Press q to end macro
Enter 1000000@a (to execute the macro a million times)

Mark wilkins · Answer 4 · 2012-02-01T01:27:44+0000

I very much agree with the comment about using a different scripting language to solve this problem. Ruby, Perl, or Python would probably be better. But for fun, there is an ugly Awk script that does this. Matches may possibly use some work if they do not fit. But implements a simple state machine. It keeps track of whether it is in the header or not and determines whether it is obsolete. It stores header lines in an array. When it reaches the end of the heading, it prints the heading (if not outdated). If not in the header, it prints lines if the previous section is not outdated.

 { if ( $0 ~ /-- \/\**+\// ) { # This matches one of the -- /*********...****/ lines if ( headercount > 0 ) { # this must be the closing line in the header if ( !deprecated ) { for ( i = 0; i < headercount; i++ ) { print headers[i] } # print closing line print } # if not deprecated headercount = 0 } else { # must be starting a new section headers[0] = $0 headercount = 1 deprecated = 0 } } else { if ( headercount == 0 ) { # not in a header section - print if not deprecated if ( !deprecated ) { print } } else { # in a header section - track if it is a deprecated section if ( $0 ~ /Deprecated from/ ) { deprecated = 1 } # store the header info to dump when we hit the end headers[headercount++] = $0; } } }

Removing blocks of text from a huge text file

More articles: