Looking for the best awk or perl solutions: avoid hard pipes, etc.

Question

Looking for the best awk or perl solutions: avoid hard pipes, etc.

I had to parse files that list the eigenvectors of a square matrix matrix in seven-column file format into a square matrix in which each eigenvector is a matrix column

Eigenvector file: COVAR 72 72 42.27674 53.43516 43.10335 43.43889 53.15094 43.77146 43.17536 52.49170 45.07565 42.10424 52.75460 45.74721 41.66882 52.21836 47.00361 40.21403 51.86627 47.05245 39.75512 50.92583 47.83411 38.36019 50.61541 48.00747 37.56547 51.66199 48.72199 36.29018 51.70312 48.54869 35.35773 52.59045 49.19493 34.14085 51.90543 49.78376 33.43961 52.55997 50.66576 32.13812 52.14743 51.17284 31.02647 52.41422 50.19470 30.02426 51.60068 50.14591 28.86206 51.70417 49.28895 27.52769 51.49614 49.94867 27.52460 50.99136 51.12215 26.37751 50.74786 51.93507 25.23025 50.04549 51.26765 25.46212 49.27591 50.30035 24.47349 48.61017 49.51955 23.64720 49.41136 48.60875 **** 1 3.28044 0.06504 -0.20409 -0.08035 0.04603 -0.02034 -0.02343 0.03885 0.14025 0.01970 -0.00569 0.11391 -0.05271 -0.00874 0.25005 -0.02425 0.03969 0.13327 0.01054 0.09958 0.20857 0.08647 0.13883 0.12003 0.12859 0.05634 0.06415 0.02570 0.07466 -0.06541 0.04636 0.01246 -0.13691 -0.04270 0.03791 -0.15341 -0.02595 -0.01027 -0.15604 -0.08393 -0.00526 -0.16938 -0.09027 0.01573 -0.25999 -0.09350 0.01121 -0.24367 -0.01033 0.03059 -0.31268 -0.00040 0.02074 -0.17927 -0.01689 -0.02183 -0.03912 -0.01481 -0.03982 0.10507 -0.03446 -0.06896 0.20946 -0.00450 -0.17669 0.17617 0.08755 -0.21143 0.25313 0.12818 -0.13896 0.16625 0.06539 **** 2 1.17147 0.05028 0.24209 0.07571 0.07015 0.26226 0.10552 0.09788 0.15535 0.10020 0.06248 0.07167 0.09337 0.06555 -0.05258 0.07777 0.05163 -0.08617 -0.01580 0.05087 -0.17374 -0.06483 0.03157 -0.18854 -0.12423 0.02388 -0.15753 -0.07304 0.00221 -0.12406 -0.11678 -0.00030 -0.07568 -0.07783 -0.00225 -0.10201 -0.09521 0.00373 -0.10066 -0.06755 -0.00386 -0.10808 -0.08343 -0.01420 -0.03899 -0.11123 -0.06186 -0.02282 -0.11633 -0.07596 0.03656 -0.14599 -0.07542 0.13621 -0.11299 -0.07350 0.22728 -0.02254 -0.07473 0.32577 0.01167 -0.09106 0.17148 0.10912 -0.01607 0.00303 0.19984 -0.01223 -0.16824 0.28827 -0.00879 -0.23259 0.16630 **** 3 et cetera ....

I managed to solve my problem, as I could, with a lot of pipes ... this is an excerpt from my script, which also extracts eigenvalues (the number next to the natural numbers under **** )

 local dimensions=$(awk 'NR==2 {print$1}' ${ptraj_eigvect[$k]}) #in the second line of the file it is written the dimension of the rotation matrix #Ptraj produces a file in seven columns format # || # \/ if [[ $((${dimensions} % 7 )) == 0 ]] then local -i n_rows_eigvect_ptraj=$(( ${dimensions} / 7 )) else local -i n_rows_eigvect_ptraj=$(( (${dimensions} / 7) + 1 )) fi # headers matrix **** # || ||||||||||||||||||||||| || # \/ \/\/\/\/\/\/\//\/\/\/\/ \/ awk 'NR>'$(( 2 + ${n_rows_eigvect_ptraj} + 1 ))' && NR%'$(( 2 + ${n_rows_eigvect_ptraj} ))'==2' ${ptraj_eigvect[$k]} >${eigval_file} awk 'NR>'$(( 2 + ${n_rows_eigvect_ptraj} + 2 ))' && NR%'$(( 2 + ${n_rows_eigvect_ptraj} ))'!=2 && NR%'$(( 2 + ${n_rows_eigvect_ptraj} ))'!=1' ${ptraj_eigvect[$k]} | xargs printf "%s\n" | awk '($0=$NF x)&&ORS=NR%'${dimensions}'?FS:RS' | awk -f ${script_PA}/transpose.awk >${rotmatr_file} if [[ $(wc -l <${rotmatr_file}) != ${dimensions} ]] || [[ $(wc -w <${rotmatr_file}) != $(( ${dimensions} * ${dimensions} )) ]] then echo 'ERROR!!!' exit 1 fi

transpose.awk file here

I edit as requested

my script created as a 72 x 72 square matrix, here I write only the first 2 columns. You can see that the numbers correspond to the numbers after 1 3.28044 and 2 1.17147

 0.06504 0.05028 -0.20409 0.24209 -0.08035 0.07571 0.04603 0.07015 -0.02034 0.26226 -0.02343 0.10552 0.03885 0.09788 0.14025 0.15535 0.01970 0.10020 -0.00569 0.06248 0.11391 0.07167 -0.05271 0.09337 -0.00874 0.06555 0.25005 -0.05258 -0.02425 0.07777 0.03969 0.05163 0.13327 -0.08617 0.01054 -0.01580 0.09958 0.05087 0.20857 -0.17374 0.08647 -0.06483 0.13883 0.03157 0.12003 -0.18854 0.12859 -0.12423 0.05634 0.02388 0.06415 -0.15753 0.02570 -0.07304 0.07466 0.00221 -0.06541 -0.12406 0.04636 -0.11678 0.01246 -0.00030 -0.13691 -0.07568 -0.04270 -0.07783 0.03791 -0.00225 -0.15341 -0.10201 -0.02595 -0.09521 -0.01027 0.00373 -0.15604 -0.10066 -0.08393 -0.06755 -0.00526 -0.00386 -0.16938 -0.10808 -0.09027 -0.08343 0.01573 -0.01420 -0.25999 -0.03899 -0.09350 -0.11123 0.01121 -0.06186 -0.24367 -0.02282 -0.01033 -0.11633 0.03059 -0.07596 -0.31268 0.03656 -0.00040 -0.14599 0.02074 -0.07542 -0.17927 0.13621 -0.01689 -0.11299 -0.02183 -0.07350 -0.03912 0.22728 -0.01481 -0.02254 -0.03982 -0.07473 0.10507 0.32577 -0.03446 0.01167 -0.06896 -0.09106 0.20946 0.17148 -0.00450 0.10912 -0.17669 -0.01607 0.17617 0.00303 0.08755 0.19984 -0.21143 -0.01223 0.25313 -0.16824 0.12818 0.28827 -0.13896 -0.00879 0.16625 -0.23259 0.06539 0.16630

Since I am trying to learn awk and possibly in the future perl, I ask you to please teach me how to write an awk or perl script that performs the same task

Thank you very much for your attention.

+4

bash awk perl pipe xargs

Mareczek Nov 13 '11 at 1:51

source share

5 answers

TLP · Answer 1 · 2011-11-13T03:49:12+0000

I worked on this for a while, didn’t come up with anything very beautiful, but the code seems to work, despite the fact that it is rather clumsy. It assumes your data is completely consistent and does not care about the headers.

On the plus side, if you change <DATA> to <> , it will work in your data file with:

 > script.pl input > output

This means that your data file has the same formatting as your example, and that your vectors are displayed in numerical order.

Code:

 use strict; use warnings; use v5.10; my @data; my $tmp; while (<DATA>) { if (/^\*+/) { # or some other way of separating vectors push @data, $tmp if $tmp; # push buffer to array <DATA>; # discard header $tmp = ""; # reset buffer } else { $tmp .= $_; # buffer a new line } } push @data, $tmp; # push remaining buffer onto array @data = map { [ split ] } @data; # split string into array for my $num (0 .. $#{$data[0]}) { say join " ", map $data[$_][$num], keys @data; } __DATA__ **** 1 3.28044 0.06504 -0.20409 -0.08035 0.04603 -0.02034 -0.02343 0.03885 0.14025 0.01970 -0.00569 0.11391 -0.05271 -0.00874 0.25005 -0.02425 0.03969 0.13327 0.01054 0.09958 0.20857 0.08647 0.13883 0.12003 0.12859 0.05634 0.06415 0.02570 0.07466 -0.06541 0.04636 0.01246 -0.13691 -0.04270 0.03791 -0.15341 -0.02595 -0.01027 -0.15604 -0.08393 -0.00526 -0.16938 -0.09027 0.01573 -0.25999 -0.09350 0.01121 -0.24367 -0.01033 0.03059 -0.31268 -0.00040 0.02074 -0.17927 -0.01689 -0.02183 -0.03912 -0.01481 -0.03982 0.10507 -0.03446 -0.06896 0.20946 -0.00450 -0.17669 0.17617 0.08755 -0.21143 0.25313 0.12818 -0.13896 0.16625 0.06539 **** 2 1.17147 0.05028 0.24209 0.07571 0.07015 0.26226 0.10552 0.09788 0.15535 0.10020 0.06248 0.07167 0.09337 0.06555 -0.05258 0.07777 0.05163 -0.08617 -0.01580 0.05087 -0.17374 -0.06483 0.03157 -0.18854 -0.12423 0.02388 -0.15753 -0.07304 0.00221 -0.12406 -0.11678 -0.00030 -0.07568 -0.07783 -0.00225 -0.10201 -0.09521 0.00373 -0.10066 -0.06755 -0.00386 -0.10808 -0.08343 -0.01420 -0.03899 -0.11123 -0.06186 -0.02282 -0.11633 -0.07596 0.03656 -0.14599 -0.07542 0.13621 -0.11299 -0.07350 0.22728 -0.02254 -0.07473 0.32577 0.01167 -0.09106 0.17148 0.10912 -0.01607 0.00303 0.19984 -0.01223 -0.16824 0.28827 -0.00879 -0.23259 0.16630

Chris · Answer 2 · 2011-11-13T06:48:16+0000

for awk solution try the following. Save these commands in the s.awk file:

 /\*\*\*/{i++;accInd=0;next} (i>0){for (k=1;k <= NF;k++){ I=k+accInd a[i,I]=$k } accInd=accInd+(k-1) } END{for (n=3;n<=I;n++){ for (m=1;m<=i;m++){ printf "%f\t", a[m,n] } printf "\n" } }

Then run this command from the command line:

 $ awk -f s.awk file

HTH Chris

wm.wragg · Answer 3 · 2011-11-13T07:00:16+0000

If I understand the problem correctly, I think this AWK script will do the job, I tried to simplify its reading and understanding, so a rather detailed script:

 #### # Use like: # # awk -f transpose.awk <Eigenvector file> # # This script assumes that all Eigenvectors in the file, have the same number # of values. The script will output all Eigenvectors into columns eg if three # Eigenvectors it will produce three columns of values. # #### BEGIN { # Keeps track of the number of Eigenvectors currentEV = 0; } # Signifies a new Eigenvector (EV) $1 == "****" { newEV = "true"; transpose = "true"; next; } # Get the EV number newEV == "true" { newEV = "false"; currentEV = $1; currentEVCol = 0; next; } # Add all the values on the line, for the current EV, into the EV array transpose == "true" { for (i=1; i<=NF; i++) { ev[currentEV,++currentEVCol] = $i; } } END { # Loop through the array and print EV ou in columns for (i=1; i<=currentEVCol; i++) { for (j=1; j<=currentEV; j++) { printf ev[j,i] " "; } print ""; } }

For a short version, copy the following into a file called transpose.awk:

 skip { skip = 0; next; } $1 == "****" { EV++; EVC = 0; skip = 1; next; } NF && EV { for (i=1; i<=NF; i++) { EVA[EV,++EVC] = $i; } } END { for (i=1; i<=EVC; i++) { for (j=1; j<=EV; j++) { printf EVA[j,i] " "; } print ""; } }

And name $ awk -f transpose.awk file > transposedFile

Homer6 · Answer 4 · 2011-11-13T01:56:17+0000

If you want to use encoding in C ++, you can use Boost :: regex or bend / bison .

Joel berger · Answer 5 · 2011-11-13T20:28:36+0000

Rather similar to TLP, but a little cleaner in my opinion. It also stores its own value in a separate array. According to him, you can change <DATA> to <> and run as scriptname.pl mydata.dat (after that you can remove the __DATA__ tag and everything after it).

It uses the Array::Transpose add-on to transpose (installation using the cpan ). For visualization, the Data::Dumper module and its Dumper function are used. grep { length } bits remove the empty elements found by split , this can probably be eliminated by removing leading spaces, but this seemed more reliable.

 #!/usr/bin/env perl use strict; use warnings; use Data::Dumper; use Array::Transpose; my $row = -1; my @eigen; my @data; while( <DATA> ) { if (/\*+/) { #increment row number $row++; #next line is eigenvalue, keep it in @eigen my @line = grep { length } split( /\s+/, <DATA>); push @eigen, $line[-1]; # move on to next line next; } next if $row < 0; #skip first block push @{ $data[$row] }, grep { length } split( /\s+/ ); } my @transpose = transpose(\@data); print Dumper \@eigen; print Dumper \@transpose; __DATA__ Eigenvector file: COVAR 72 72 42.27674 53.43516 43.10335 43.43889 53.15094 43.77146 43.17536 52.49170 45.07565 42.10424 52.75460 45.74721 41.66882 52.21836 47.00361 40.21403 51.86627 47.05245 39.75512 50.92583 47.83411 38.36019 50.61541 48.00747 37.56547 51.66199 48.72199 36.29018 51.70312 48.54869 35.35773 52.59045 49.19493 34.14085 51.90543 49.78376 33.43961 52.55997 50.66576 32.13812 52.14743 51.17284 31.02647 52.41422 50.19470 30.02426 51.60068 50.14591 28.86206 51.70417 49.28895 27.52769 51.49614 49.94867 27.52460 50.99136 51.12215 26.37751 50.74786 51.93507 25.23025 50.04549 51.26765 25.46212 49.27591 50.30035 24.47349 48.61017 49.51955 23.64720 49.41136 48.60875 **** 1 3.28044 0.06504 -0.20409 -0.08035 0.04603 -0.02034 -0.02343 0.03885 0.14025 0.01970 -0.00569 0.11391 -0.05271 -0.00874 0.25005 -0.02425 0.03969 0.13327 0.01054 0.09958 0.20857 0.08647 0.13883 0.12003 0.12859 0.05634 0.06415 0.02570 0.07466 -0.06541 0.04636 0.01246 -0.13691 -0.04270 0.03791 -0.15341 -0.02595 -0.01027 -0.15604 -0.08393 -0.00526 -0.16938 -0.09027 0.01573 -0.25999 -0.09350 0.01121 -0.24367 -0.01033 0.03059 -0.31268 -0.00040 0.02074 -0.17927 -0.01689 -0.02183 -0.03912 -0.01481 -0.03982 0.10507 -0.03446 -0.06896 0.20946 -0.00450 -0.17669 0.17617 0.08755 -0.21143 0.25313 0.12818 -0.13896 0.16625 0.06539 **** 2 1.17147 0.05028 0.24209 0.07571 0.07015 0.26226 0.10552 0.09788 0.15535 0.10020 0.06248 0.07167 0.09337 0.06555 -0.05258 0.07777 0.05163 -0.08617 -0.01580 0.05087 -0.17374 -0.06483 0.03157 -0.18854 -0.12423 0.02388 -0.15753 -0.07304 0.00221 -0.12406 -0.11678 -0.00030 -0.07568 -0.07783 -0.00225 -0.10201 -0.09521 0.00373 -0.10066 -0.06755 -0.00386 -0.10808 -0.08343 -0.01420 -0.03899 -0.11123 -0.06186 -0.02282 -0.11633 -0.07596 0.03656 -0.14599 -0.07542 0.13621 -0.11299 -0.07350 0.22728 -0.02254 -0.07473 0.32577 0.01167 -0.09106 0.17148 0.10912 -0.01607 0.00303 0.19984 -0.01223 -0.16824 0.28827 -0.00879 -0.23259 0.16630

Looking for the best awk or perl solutions: avoid hard pipes, etc.

More articles: