Python regular expression Extract text between patterns

How to get all the values ​​between "uniprotkb:" and "(gene name)" in "str" ​​below:

str = 'uniprotkb:HIST1H3D(gene name)|uniprotkb:HIST1H3A(gene name)|uniprotkb:HIST1H3B(gene name)|uniprotkb:HIST1H3C(gene name)|uniprotkb:HIST1H3E(gene name)|uniprotkb:HIST1H3F(gene name)|uniprotkb:HIST1H3G(gene name)|uniprotkb:HIST1H3H(gene name)|uniprotkb:HIST1H3I(gene name)|uniprotkb:HIST1H3J(gene name)' 

Result:

 HIST1H3D HIST1H3A HIST1H3B HIST1H3C HIST1H3E HIST1H3F HIST1H3G HIST1H3H HIST1H3I HIST1H3J 
+4
source share
3 answers

Using re.findall (), you can get all parts of the string that match the regular expression:

 >>> import re >>> sstr = 'uniprotkb:HIST1H3D(gene name)|uniprotkb:HIST1H3A(gene name)|uniprotkb:HIST1H3B(gene name)|uniprotkb:HIST1H3C(gene name)|uniprotkb:HIST1H3E(gene name)|uniprotkb:HIST1H3F(gene name)|uniprotkb:HIST1H3G(gene name)|uniprotkb:HIST1H3H(gene name)|uniprotkb:HIST1H3I(gene name)|uniprotkb:HIST1H3J(gene name)' >>> re.findall(r'uniprotkb:([^(]*)\(gene name\)', sstr) ['HIST1H3D', 'HIST1H3A', 'HIST1H3B', 'HIST1H3C', 'HIST1H3E', 'HIST1H3F', 'HIST1H3G', 'HIST1H3H', 'HIST1H3I', 'HIST1H3J'] 
+8
source

Here is the onlineer:

 astr = 'uniprotkb:HIST1H3D(gene name)|uniprotkb:HIST1H3A(gene name)|uniprotkb:HIST1H3B(gene name)|uniprotkb:HIST1H3C(gene name)|uniprotkb:HIST1H3E(gene name)|uniprotkb:HIST1H3F(gene name)|uniprotkb:HIST1H3G(gene name)|uniprotkb:HIST1H3H(gene name)|uniprotkb:HIST1H3I(gene name)|uniprotkb:HIST1H3J(gene name)' [pt.split('(')[0] for pt in astr.strip().split('uniprotkb:')][1:] 

gives:

 ['HIST1H3D', 'HIST1H3A', 'HIST1H3B', 'HIST1H3C', 'HIST1H3E', 'HIST1H3F', 'HIST1H3G', 'HIST1H3H', 'HIST1H3I', 'HIST1H3J'] 

I do not recommend regexp solutions if runtime.

0
source

I will not worry about regex:

 s = 'uniprotkb:HIST1H3D(gene name)|uniprotkb:HIST1H3A(gene name)' # etc gene_names = [] for substring in s.split('|'): removed_first = substring.partition('uniprotkb:')[2] # remove the first part of the substring removed_second = removed_first.partition('(gene name)')[0] # remove the second part gene_names.append(removed_second) # put it on the list 

gotta do the trick. You can even make a single-line image - this is equivalent to:

 gene_names = [substring.partition('uniprotkb:')[2].partition('(gene name)')[0] for substring in s.split('|')] 
-1
source

Source: https://habr.com/ru/post/1437406/


All Articles