I have a list of ~ 1 million unique 16-character strings (an array called VEC), and I want to calculate the minimum hamming pair distance for each in Python (an array called RES). Basically, I calculate the full pair distance matrix one row at a time, but save only the minimum value in RES for each row.
VEC= ['AAAAAAAAAAAAAAAA','AAAAAAAAAAAAAAAT','AAAAGAAAAAATAAAA'...]
so dist (VEC [1], VEC [2]) = 1, dist (VEC [1], VEC [3]) = 2, etc. and RES [1] = 1. Using the tips and tricks from these pages, I came up with:
import Levenshtein
import numpy
RES=99*numpy.ones(len(VEC))
i=0
for a in VEC:
dist=numpy.array([Levenshtein.hamming(a,b) for b in VEC] )
RES[i]=numpy.amin(dist[dist>0])
i+=1
VEC 10 000 70 , , 8 . , , RES , :
import Levenshtein
import numpy
RES=99*numpy.ones(len(VEC))
for i in range(len(VEC)-1):
dist=[Levenshtein.hamming(VEC[i],VEC[j]) for j in range(i+1, len(VEC))]
RES[i]=min(numpy.amin(dist),RES[i])
k=0
for j in range(i+1,len(VEC)):
if dist[k]<RES[j]:
RES[j]=dist[k]
k+=1
, , 2- (117 ), . , - /, ?