Print only matching heading fields starting after matched entries

I am trying to extract specific fields from my file. In fact, the output fields contain only a matching expression, the output of which begins after the matched entries.

This is an example of my input. Sometimes the fields are in different orders and also have a different number of lines in front of the heading I'm trying to match.

It was difficult for me to learn how to achieve this with the cut and sed commands, and could not find the awk method.

CGATS.17 FORMAT_VERSION 1 KEYWORD "SampleID" KEYWORD "SAMPLE_NAME" NUMBER_OF_FIELDS 45 WEIGHTING_FUNCTION "ILLUMINANT, D50" WEIGHTING_FUNCTION "OBSERVER, 2 degree" BEGIN_DATA_FORMAT SampleID SAMPLE_NAME CMYK_C CMYK_M CMYK_Y CMYK_K LAB_L LAB_A LAB_B nm380 nm390 nm400 END_DATA_FORMAT NUMBER_OF_SETS 182 BEGIN_DATA 1 1 40 40 40 0 62.5 6.98 4.09 0.195213 0.205916 0.212827 2 2 0 40 40 0 73.69 25.48 24.89 0.200109 0.211081 0.218222 3 3 40 40 0 0 63.95 12.14 -20.91 0.346069 0.365042 0.377148 4 4 0 70 70 0 58.91 47.69 35.54 0.080033 0.084421 0.087317 END_DATA 

This is the dirty code that I used, which basically did the job, but without a conditional search for the field header. The awk command is simply to remove the empty lines surrounding the output.

 cut -f 7-9 -s input.txt | sed -E 's/(LAB_.)//g' | awk 'NF' > file.txt 

The result that I expect will be as follows. It still has a tab delimiter containing only the values โ€‹โ€‹of fields starting directly (LAB _.)

 62.5 6.98 4.09 73.69 25.48 24.89 63.95 12.14 -20.91 58.91 47.69 35.54 
+4
source share
3 answers

Script:

 #!/usr/bin/awk -f # We look for line starting with BEGIN_DATA_FORMAT do the getline function and # store location of fields that have "LAB" in their name on the next line. /^BEGIN_DATA_FORMAT/{ getline for (i=1;i<=NF;i++) if ($i~/LAB/) a[i]=$i } # In this regex range we look for lines that have more than 2 fields. For those # lines we loop thru each field and see if the location matches to the ones # captured in our earlier array (ie location number of fields that have "LAB" # in their name). If we find a match we print those fields. /^BEGIN_DATA$/,/^END_DATA$/{ s=""; if (NF<2) next; else for (j in a) s=s?s"\t"$j:$j print s; } 

Test:

 [jaypal:~/Temp] ./script.awk file 62.5 6.98 4.09 73.69 25.48 24.89 63.95 12.14 -20.91 58.91 47.69 35.54 
+1
source

another awk script:

  awk '/^BEGIN_DATA_FORMAT/{getline;f=NF;for(i=1;i<=NF;i++)if($i~/^LAB_[LAB]/)l[i]++;} /^BEGIN_DATA/,/^END_DATA/ && NF==f{s=""; for(x in l)s=s?s"\t"$x:$x; print s;}' input 

the output of your input example:

 62.5 6.98 4.09 73.69 25.48 24.89 63.95 12.14 -20.91 58.91 47.69 35.54 

A few notes on the awk script above:

  • Handling headers is similar to @JayPal's solution, but with slightly different ones: you mentioned that the order of the columns may be different, so to match the headers, my awk script looked at the next line "BEGIN_DATA_FORMAT". since the first column of the head may be something other than SampleID.

  • at the output, as you expected, print only the values โ€‹โ€‹([tab] separated), but no header. if you said that the column order may be variable, you may lose header information. let's say which column is LAB_L, which is A? etc. this can be easily done if it is really necessary.

+1
source

This might work for you:

  sed '/^BEGIN_DATA\>/,/^END_DATA\>/{//d;s/\(\S*\s*\)\{6\}\(\S*\s*\S*\s*\S*\).*/\2/p};d' file 

Or staying with cut :

 cut -f7-9 file | sed '/^\([-.0-9]*\s*[-.0-9]*\s*[-.0-9.]*$\)/!d' 

Or (but I guess here in the format of your input file):

 sed 's/\s*$//' file | cut -f7-9 | sed '/^BEGIN_DATA$/,/^END_DATA$/{//d;p};d' 
0
source

Source: https://habr.com/ru/post/1392169/


All Articles