I have a file with the following data -
Input -
A B C D E F
A B B B B B
C A C D E F
A B D E F A
A A A A A F
A B C B B B
If any of the other lines, starting from line 2, has the same letter as line 1, they should be changed to 1. Basically, I am trying to find out how similar any of the lines in the first line are.
Desired Result -
1 1 1 1 1 1
1 1 B B B B
C A 1 1 1 1
1 1 D E F A
1 A A A A 1
1 1 1 B B B
The first line became all 1, since it is identical to itself (obviously). In the second row, the first and second columns are identical to the first row ( A B) and therefore become 1 1. And so on for other lines.
I wrote the following code that does this conversion -
for seq in {1..1} ;
do
for position in {1..6} ;
do
aa=$(awk -v pos=$position -v line=$seq 'NR == line {print $pos}' f)
awk -v var=$aa -v pos=$position '{gsub (var, "1", $pos)} 1' f > temp
mv temp f
done
done
As you can imagine, this is very slow because this nested loop is expensive. My real data is a 60x10000 matrix, and it takes about 2 hours for this program.
, , 6 gsubs . , ? awk .