Read the large file and display sections corresponding to several parameters

I rarely have to deal with scenarios, so I am against the lack of knowledge for this problem.

I have a file> 500 MB in the text, which is well separated, but I know that inside there are 5-10 "bad" sections. The data in the sections can be easily estimated by a person, I do not know how to do this in the program.

I take away the well-known good value in #Field MyField - however, if that value did not appear in #FIELD LOCATION , something went wrong.

An example of two sections within a file is as follows. The first is bad and the second is good.

 #START Descriptor #FIELD LOCATION="http://path.to/file/here&Value=FOO&OtherValue=BLAH" #FIELD AnythingElse #FIELD MyField="BAR" #END #START Descriptor #FIELD LOCATION="http://path.to/file/here&Value=BAR&OtherValue=BLAH" #FIELD AnythingElse #FIELD MyField="BAR" #END 
  • Sections begin and end logically, with #START and #END

  • If #FIELD LOCATION does not exist, continue to the next section.

  • If #FIELD MyField="BAR" and #FIELD LOCATION does not contain BAR , print all the lines from this section into a new file.

  • Note. #FIELD MyField="BAR" is a control value that I insert, capturing other information about the data as this file is created (in my case, it is an indicator of a language such as EN or DE so it will literally #FIELD MyField="EN" Any other value in this field will be ignored; this is not an entry that matches my criteria.

I believe that this can be done in Awk or Perl, I can do very simple single-line, but this does not match my skills.

+6
source share
4 answers

You can do something like below. This is just a draft, but it will work with your sample data. Use the flip-flop operator to find the beginning and end of records. Use a hash to store field values ​​and an array to store the record.

I'm just checking to see if the value is in the location bar, you might want to narrow the check even further by making sure it is in the right place or in the right case.

 use strict; use warnings; my @record; my %f; while(<DATA>) { if (/^#START / .. /^#END */) { if (/^#START /) { @record = (); # reset %f = (); } push @record, $_; if (/^#END */) { # check and print if ($f{'LOCATION'} !~ /$f{'MyField'}/) { print @record; } } else { # add fields to hash if (/^#FIELD (.+)/) { # use split with limit of 2 fields my ($key, $val) = split /=/, $1, 2; next unless $val; # no empty values $val =~ s/^"|"$//g; # strip quotes $f{$key} = $val; } } } } __DATA__ #START Descriptor #FIELD LOCATION="http://path.to/file/here&Value=FOO&OtherValue=BLAH" #FIELD AnythingElse #FIELD MyField="BAR" #END #START Descriptor #FIELD LOCATION=http://path.to/file/here&Value=BAR&OtherValue=BLAH" #FIELD AnythingElse #FIELD MyField="BAR" #END 
+3
source

Single line:

 perl -ne 'BEGIN { $/ = "#END\n" }' -e '/MyField="(.*?)"/; print if !/Value=$1/' <file >newfile 

Sets the Input Record Separator to "#END\n" , so perl reads the "chunks" into $_ one at a time, and then captures the value in MyField and prints the entire fragment if Value=$1 (i.e. this capture after "Value = " ) is absent.

You can, of course, make regular expressions more specific if necessary.

+2
source

Here is a small one line gawk for you -

 gawk ' { if ($2!~/^#FIELD LOCATION/) { next; } else { split($2,ary,"=|&"); split($4,ary1,"=|\""); if(ary[4]!=ary1[3]) { print $0 > "badrec.file" } } }' RS="#END\n" ORS="#END\n" FS="\n" file 

Input file:

 [jaypal:~/Temp] cat file #START Descriptor # Good Record #FIELD LOCATION="http://path.to/file/here&Value=BAR&OtherValue=BLAH" #FIELD AnythingElse #FIELD MyField="BAR" #END #START Descriptor # Bad Record #FIELD LOCATION="http://path.to/file/here&Value=FOO&OtherValue=BLAH" #FIELD AnythingElse #FIELD MyField="BAR" #END #START Descriptor # Good Record #FIELD LOCATION="http://path.to/file/here&Value=BAR&OtherValue=BLAH" #FIELD AnythingElse #FIELD MyField="BAR" #END 

Test:

 [jaypal:~/Temp] gawk ' { if ($2!~/^#FIELD LOCATION/) { next; } else { split($2,ary,"=|&"); split($4,ary1,"=|\""); if(ary[4]!=ary1[3]) { print $0 > "badrec.file" } } }' RS="#END\n" ORS="#END\n" FS="\n" file [jaypal:~/Temp] cat badrec.file #START Descriptor # Bad Record #FIELD LOCATION="http://path.to/file/here&Value=FOO&OtherValue=BLAH" #FIELD AnythingElse #FIELD MyField="BAR" #END 
+2
source

Set the input delimiter to #END\n and read the entries immediately:

 #!/usr/bin/perl $/ = "#END\n"; while (<DATA>) { next unless /^#FIELD LOCATION/m; /^#FIELD MyField="(.*)"$/m; next if /^#FIELD LOCATION.*$1/m; print } __DATA__ #START Descriptor #FIELD LOCATION="http://path.to/file/here&Value=FOO&OtherValue=BLAH" #FIELD AnythingElse #FIELD MyField="BAR" #END #START Descriptor #FIELD LOCATION=http://path.to/file/here&Value=BAR&OtherValue=BLAH" #FIELD AnythingElse #FIELD MyField="BAR" #END 
0
source

Source: https://habr.com/ru/post/906697/


All Articles