I have not tested the performance of this, but you could also try something like
atest = "ACGTACA"
alist = atest.replace('A', '3.').replace('C', '2.').replace('G', '1.').replace('T', '0.').split('.')
anumlist = [int(x) for x in alist if x.isdigit()]
leads to:
[3, 2, 1, 0, 3, 2, 3]
Edit: Good, so testing it with atest = "ACTACA" * 100000 takes some time: / Maybe not a good idea ...
Edit 5: Another improvement:
import datetime
import numpy as np
class Test(object):
def __init__(self):
self.mapping = {'A' : 0, 'C' : 1, 'G' : 2, 'T' : 3}
def char2num(self, astring):
return [self.mapping[c] for c in astring]
def main():
now = datetime.datetime.now()
atest = "AGTCAGTCATG"*10000000
t = Test()
alist = t.char2num(atest)
testme = np.array(alist)
print testme, len(testme)
print datetime.datetime.now() - now
if __name__ == "__main__":
main()
It takes about 16 seconds for 110,000,000 characters and keeps your processor busy instead of your ram:
[0 2 3 ..., 0 3 2] 110000000
0:00:16.866659
source
share