Find specific columns and replace the next column with a specific value with gawk

Question

Find specific columns and replace the next column with a specific value with gawk

I am trying to find all the places where my data has a repeating row and delete the repeating row. Also, I'm looking for where the second column has a value of 90 and replaces the next second column with a specific number I.

My data is as follows:

# Type Response Acc RT Offset 1 70 0 0 0.0000 57850 2 31 0 0 0.0000 59371 3 41 0 0 0.0000 60909 4 70 0 0 0.0000 61478 5 31 0 0 0.0000 62999 6 41 0 0 0.0000 64537 7 41 0 0 0.0000 64537 8 70 0 0 0.0000 65106 9 11 0 0 0.0000 66627 10 21 0 0 0.0000 68165 11 90 0 0 0.0000 68700 12 31 0 0 0.0000 70221

I want my data to look like this:

  # Type Response Acc RT Offset 1 70 0 0 0.0000 57850 2 31 0 0 0.0000 59371 3 41 0 0 0.0000 60909 4 70 0 0 0.0000 61478 5 31 0 0 0.0000 62999 6 41 0 0 0.0000 64537 8 70 0 0 0.0000 65106 9 11 0 0 0.0000 66627 10 21 0 0 0.0000 68165 11 90 0 0 0.0000 68700 12 5 0 0 0.0000 70221

My code is:

  BEGIN { priorline = ""; ERROROFFSET = 50; ERRORVALUE[10] = 1; ERRORVALUE[11] = 2; ERRORVALUE[12] = 3; ERRORVALUE[30] = 4; ERRORVALUE[31] = 5; ERRORVALUE[32] = 6; ORS = "\n"; } NR == 1 { print; getline; priorline = $0; } NF == 6 { brandnewline = $0 mytype = $2 $0 = priorline priorField2 = $2; if (mytype !~ priorField2) { print; priorline = brandnewline; } if (priorField2 == "90") { mytype = ERRORVALUE[mytype]; } } END {print brandnewline} ##Here the parameters of the brandnewline is set to the current line and then the ##proirline is set to the line on which we just worked on and the brandnewline is ##set to be the next new line we are working on. (ie line 1 = brandnewline, now ##we set priorline = brandnewline, thus priorline is line 1 and brandnewline takes ##on line 2) Next, the same parameters were set with column 2, mytype being the ##current column 2 value and priorField2 being the same value as mytype moves to ##the next column 2 value. Finally, we wrote an if statement where, if the value ##in column 2 of the current line !~ (does not equal) value of column two of the ##previous line, then the current line will be print otherwise it will just be ##skipped over. The second if statement recognizes the lines in which the value ##90 appeared and replaces the value in column 2 with a previously defined ##ERRORVALUE set for each specific type (type 10=1, 11=2,12=3, 30=4, 31=5, 32=6).

I was able to successfully remove duplicate lines, however I cannot execute the next part of my code, which is to replace the values that I designated in BEGIN as ERRORVALUES (10 = 1, 11 = 2, 12 = 3, 30 = 4, 31 = 5, 32 = 6) with the actual columns that contain this value. Essentially, I just want to replace this value in the string with my ERRORVALUE.

If anyone can help me, I would be very grateful.

+4

replace awk gawk

user1269741 Mar 14 '12 at 18:34

source share

5 answers

This might work for you:

 v=99999 sed ':a;$!N;s/^\(\s*\S*\s*\)\(.*\)\s*\n.*\2/\1\2/;ta;s/^\(\s*\S*\s*\) 90 /\1'"$(printf "%5d" $v)"' /;P;D' file # Type Response Acc RT Offset 1 70 0 0 0.0000 57850 2 31 0 0 0.0000 59371 3 41 0 0 0.0000 60909 4 70 0 0 0.0000 61478 5 31 0 0 0.0000 62999 6 41 0 0 0.0000 64537 8 70 0 0 0.0000 65106 9 11 0 0 0.0000 66627 10 21 0 0 0.0000 68165 11 99999 0 0 0.0000 68700 12 31 0 0 0.0000 70221

+1

potong Mar 14 '12 at 21:23

source share

This might work for you:

 awk 'BEGIN { ERROROFFSET = 50; ERRORVALUE[10] = 1; ERRORVALUE[11] = 2; ERRORVALUE[12] = 3; ERRORVALUE[30] = 4; ERRORVALUE[31] = 5; ERRORVALUE[32] = 6; } NR == 1 { print ; next } { if (a[$2 $6]) { next } else { a[$2 $6]++ } if ( $2 == 90) { print ; n++ ; next } if (n>0) { $2 = ERRORVALUE[$2] ; n=0 } printf("% 4i% 8i% 3i% 5i% 9.4f% 6i\n", $1, $2, $3, $4, $5, $6) }' INPUTFILE

See it in action here on ideone.com .

The BEGIN IMO block is obvious. Then the following happens:

line NR == 1 prints the very first line (and switches to the next line, this rule also applies only to the very first line)
Then, checking to see if we have already seen any row with the same 2nd and 6th columns, and if so, switch to the next row, otherwise mark it as shown in the array (using the values of the concatenated columns as indecies, but note that this may knock you down if you have large values in the 2nd and small 6th (e.g. 2 0020 concatenated is 20020 and this is the same for 20 020 ), so you can add a column separator to an index of type a[$2 "-" $6] ... and you can use more columns to check even more correctly)
If line 90 in the second column prints it, the flags are replaced on the next line, and then switch to the next line (in the input file)
The next row checks the second column in ERRORVALUE and, if it finds, replaces its contents.
Then prints a formatted string.

+1

Zsolt Botykai Mar 14 '12 at 10:08

source share

I agree with Glenn that two passes over the file are nicer. You can remove your repeating, possibly endless lines, using a hash, for example:

awk '!a[$2,$3,$4,$5,$6]++' file.txt

Then you must change your values as you wish. If you want to change the value of 90 in the second column to 5000 , try something like this:

awk 'NR == 1 { print; next } { sub(/^90$/, "5000", $2); printf("%4i% 8i% 3i% 5i% 9.4f% 6i\n", $1, $2, $3, $4, $5, $6) }' file.txt

You can see that I stole the Zsolt printf expression (thanks Zsolt!) For formatting, but you can edit this if necessary. You can also pass the output from the first statement to the second for a good one-line interface:

cat file.txt | awk '!a[$2,$3,$4,$5,$6]++' | awk 'NR == 1 { print; next } { sub(/^90$/, "5000", $2); printf("%4i% 8i% 3i% 5i% 9.4f% 6i\n", $1, $2, $3, $4, $5, $6) }'

0

Steve Mar 14 '12 at 23:33

source share

The previous options work for the most part, but here, as I did, is simple and nice. Having considered other posts, I believe that this will be the most effective. In addition, it also allows you to get an additional request added to the comments to comments, so that after replacing 90, replace the two-line variable earlier. It does everything in one go.

 BEGIN { PC2=PC6=1337 replacement=5 } { if( $6 == PC6 ) next if( PC2 == 90 ) $2 = replacement replacement = PC2 PC2 = $2 PC6 = $6 printf "%4s%8s%3s%5s%9s%6s\n",$1, $2, $3, $4, $5, $6 }

Input example

  1 70 0 0 0.0000 57850 2 31 0 0 0.0000 59371 3 41 0 0 0.0000 60909 4 70 0 0 0.0000 61478 5 31 0 0 0.0000 62999 6 41 0 0 0.0000 64537 7 41 0 0 0.0000 64537 8 70 0 0 0.0000 65106 9 11 0 0 0.0000 66627 10 21 0 0 0.0000 68165 11 90 0 0 0.0000 68700 12 31 0 0 0.0000 70221

Output example

  1 70 0 0 0.000000 57850 2 31 0 0 0.000000 59371 3 41 0 0 0.000000 60909 4 70 0 0 0.000000 61478 5 31 0 0 0.000000 62999 6 41 0 0 0.000000 64537 8 70 0 0 0.000000 65106 9 11 0 0 0.000000 66627 10 21 0 0 0.000000 68165 11 90 0 0 0.000000 68700 12 21 0 0 0.000000 70221

0

Jd Mar 23 '12 at 12:44

source share

glenn jackman · Accepted Answer · 2012-03-14T19:47:36+0000

One problem is that you cannot just compare one line with the previous one, as the ID number will be different.

 awk ' BEGIN { ERRORVALUE[10] = 1 # ... etc } # print the header NR == 1 {print; next} NR == 2 || $0 !~ prev_regex { prev_regex = sprintf("^\\s+\\w+\\s+%s\\s+%s\\s+%s\\s+%s\\s+%s",$2,$3,$4,$5,$6) if (was90) $2 = ERRORVALUE[$2] print was90 = ($2 == 90) } '

For rows where the second column changes, this breaks the formatting of the row:

  # Type Response Acc RT Offset 1 70 0 0 0.0000 57850 2 31 0 0 0.0000 59371 3 41 0 0 0.0000 60909 4 70 0 0 0.0000 61478 5 31 0 0 0.0000 62999 6 41 0 0 0.0000 64537 8 70 0 0 0.0000 65106 9 11 0 0 0.0000 66627 10 21 0 0 0.0000 68165 11 90 0 0 0.0000 68700 12 5 0 0 0.0000 70221

If this is a problem, you can pass gawk output to column -t , or if you know that the format of the string is fixed, use printf () in the awk program.

Find specific columns and replace the next column with a specific value with gawk

More articles: