Here's the vector approach -
# Get the row, col indices that are to be set in output array r,c = np.tril_indices(ndocs,-1) # Use those indicees to slice out respective columns p1 = rawdata[:,c] p2 = rawdata[:,r] # Perform n11 and n00 vectorized computations across all indexed columns n11v = ((p1==1) & (p2==1)).sum(0) n00v = ((p1==0) & (p2==0)).sum(0) # Finally, setup output array and set final division computations out = np.eye(ndocs) out[c,r] = n11v / (nfeats-n00v)
An alternative way to calculate n11v
and n00v
with np.einsum
is
n11v = np.einsum('ij,ij->j',(p1==1),(p2==1).astype(int)) n00v = np.einsum('ij,ij->j',(p1==0),(p2==0).astype(int))
If rawdata
consists of only 0s
and 1s
, an easier way to get them is
n11v = np.einsum('ij,ij->j',p1,p2) n00v = np.einsum('ij,ij->j',1-p1,1-p2)
Benchmarking
Function Definitions -
def original_app(rawdata, ndocs, nfeats): tru_sim = np.zeros((ndocs,ndocs)) for i in range(0,ndocs): tru_sim[i,i]=1 for j in range(i+1,ndocs): tru_sim[i,j] = jaccard(rawdata[:,i],rawdata[:,j]) return tru_sim def vectorized_app(rawdata, ndocs, nfeats): r,c = np.tril_indices(ndocs,-1) p1 = rawdata[:,c] p2 = rawdata[:,r] n11v = ((p1==1) & (p2==1)).sum(0) n00v = ((p1==0) & (p2==0)).sum(0) out = np.eye(ndocs) out[c,r] = n11v / (nfeats-n00v) return out
Check and Timing -
In [6]:
Some magic 300x+
speed there!
So why is it so fast? Well, there are many factors, the most important of which is that NumPy arrays are built for performance and optimized for vectorized computing. With the proposed approach, we use it pretty well and, thus, we see such accelerations.
Here is one related Q&A
that details these performance criteria.