Match patterns and replace string using if else loop

I have a file containing several lines starting with "1ECLI H --- 12.345 .....". I want to remove the gap between I and H and add R / S / T to iterate over the H. pattern for example. H810, if it is repeated in successive three lines, it must be added with the letter R, S (second iteration), T (third iteration). so it will be the H810R. Any help would be appreciated. the text looks below

1ECLI H813 98 7.529 8.326 9.267 1ECLI H813 99 7.427 8.470 9.251 1ECLI C814 100 7.621 8.513 9.263 1ECLI H814 101 7.607 8.617 9.289 1ECLI H814 102 7.633 8.489 9.156 1ECLI H814 103 7.721 8.509 9.305 1ECLI C74 104 8.164 8.733 10.740 1ECLI H74R 105 8.247 8.690 10.799 

at voice

 1ECLI H813R 98 7.529 8.326 9.267 1ECLI H813S 99 7.427 8.470 9.251 1ECLI C814 100 7.621 8.513 9.263 1ECLI H814R 101 7.607 8.617 9.289 1ECLI H814s 102 7.633 8.489 9.156 1ECLI H814T 103 7.721 8.509 9.305 1ECLI C74 104 8.164 8.733 10.740 1ECLI H74R 105 8.247 8.690 10.799 

Thanks.

+5
source share
3 answers

Even below you can give the desired result if your real input file matches what you published.

 awk 'BEGIN{split("R,S,T",a,/,/)}f=$2~/^H[0-9]+$/{$2 = $2 a[++c]}!f{c=0}1' infile 

Explanation

  • split("R,S,T",a,/,/) - split the string "R,S,T" into a separator comma and save it in array a , so it will become a[1] = R, a[2] = S, a[3] = T

  • f=$2~/^H[0-9]+$/ - f is a variable, validate regexp $2 ~ /^H[0-9]+$/ , which returns a logical state. if it returns true , then the variable f will be true, otherwise false

  • $2 = $2 a[++c] , if the one above was true, then change the second field, so the second field will have the existing value plus the value of the array a corresponding to index ( c ), ++c increment variable

  • !f{c=0} if the variable f is false, then reset variable c rather than sequential.

  • 1 At the end, the default operation is performed, which is the current print / record / line, print $0 . To find out how awk works, awk '1' infile , which will print all records / lines, and awk '0' infile does not print anything. Any number other than zero is true, which causes the default behavior.

Test results:

 $ cat infile 1ECLI H813 98 7.529 8.326 9.267 1ECLI H813 99 7.427 8.470 9.251 1ECLI C814 100 7.621 8.513 9.263 1ECLI H814 101 7.607 8.617 9.289 1ECLI H814 102 7.633 8.489 9.156 1ECLI H814 103 7.721 8.509 9.305 1ECLI C74 104 8.164 8.733 10.740 1ECLI H74R 105 8.247 8.690 10.799 $ awk 'BEGIN{split("R,S,T",a,/,/)}f=$2~/^H[0-9]+$/{$2 = $2 a[++c]}!f{c=0}1' infile 1ECLI H813R 98 7.529 8.326 9.267 1ECLI H813S 99 7.427 8.470 9.251 1ECLI C814 100 7.621 8.513 9.263 1ECLI H814R 101 7.607 8.617 9.289 1ECLI H814S 102 7.633 8.489 9.156 1ECLI H814T 103 7.721 8.509 9.305 1ECLI C74 104 8.164 8.733 10.740 1ECLI H74R 105 8.247 8.690 10.799 

If you want to format better, for example, tab or some other char as a field separator, then you can use below one, change the OFS variable

 $ awk -v OFS="\t" 'BEGIN{split("R,S,T",a,/,/)}f=$2~/^H[0-9]+$/{$2 = $2 a[++c]}!f{c=0}{$1=$1}1' infile 1ECLI H813R 98 7.529 8.326 9.267 1ECLI H813S 99 7.427 8.470 9.251 1ECLI C814 100 7.621 8.513 9.263 1ECLI H814R 101 7.607 8.617 9.289 1ECLI H814S 102 7.633 8.489 9.156 1ECLI H814T 103 7.721 8.509 9.305 1ECLI C74 104 8.164 8.733 10.740 1ECLI H74R 105 8.247 8.690 10.799 
+1
source

If your Input_file is the same as the example shown, can you try awk and let me know if that helps you.

 awk ' BEGIN{ val[1]="R"; val[2]="S"; val[3]="T" } $2 !~ /^H[0-9]+/ || i==3{ i="" } $2 ~ /^H[0-9]+$/ && /^1ECLI/{ $2=$2val[++i] } 1 ' Input_file > temp_file && mv temp_file Input_file 

Adding explanations also for the answer is also as follows.

 awk ' BEGIN{ ##Starting BEGIN section of awk here. val[1]="R"; ##creating an array named val whose index is 1 and value is string R. val[2]="S"; ##creating array val 2nd element here whose value is S. val[3]="T" ##creating array val 3rd element here whose value is T. } $2 !~ /^H[0-9]+/ || i==3{ ##Checking condition if 2nd field does not start from H and digits after that OR variable i value is equal to 3. i="" ##Then nullifying the value of variable i here. } $2 ~ /^H[0-9]+$/ && /^1ECLI/{ ##Checking here if 2nd field value is starts from H till all digits till end AND line starts from 1ECLI string then do following. $2=$2val[++i] ##re-creating value of 2nd field by adding value of array val whose index is increasing value of variable i. } 1 ##Mentioning 1 here, which means it will print the current line. ' Input_file > temp_file && mv temp_file Input_file ##Mentioning Input_file name here. 
+2
source

The code below assumes that lines is a list of lines representing a line in your file.


 with open('filename') as f: lines = f.readlines() from collections import defaultdict cntd = defaultdict(lambda: 0) suffix = ['R', 'S', 'T'] newlines = [] for line in lines: try: kwd = line.split()[1] except IndexError: newlines.append(line) continue if kwd[0] == 'H' and kwd[-1].isdigit(): sfx = suffix[cntd[kwd]] idx = line.index(kwd) nl = line[:idx -1] + kwd + sfx + line[idx + len(kwd):] # nl = line[:idx + len(kwd)] + sfx + line[idx + len(kwd):] # adjust formatting to your taste newlines.append(nl) cntd[kwd] += 1 else: newlines.append(line) with open('filename', 'w') as f: f.writelines(newlines) 
0
source

Source: https://habr.com/ru/post/1273136/


All Articles