Calculating Jaccard affinities in Python

Question

Calculating Jaccard affinities in Python

I have 20,000 documents that I want to compute for the true Jaccard resemblance, so I can later check how accurately the MinWise hash approaches it.

Each document is represented as a column in the numpy matrix, where each row is a word that either appears in the document (entry = 1) or is not (entry = 0). There are ~ 600 words (lines).

So, for example, column 1 will be [1 0 0 0 0 0 1 0 0 0 1 0], which means that the words 1,7,11 appeared in it, and the others did not.

Is there a more efficient way to calculate similarities besides my elementary approach to matching? I don’t see how I could use the sets to improve speed, since the sets just become (0,1), but since it is standing, the code is impossible more slowly.

import numpy as np #load file into python rawdata = np.loadtxt("myfile.csv",delimiter="\t") #Convert the documents from rows to columns rawdata = np.transpose(rawdata) #compute true jacard similarity ndocs = rawdata.shape[1] nwords = rawdata.shape[0] tru_sim = np.zeros((ndocs,ndocs)) #computes jaccard similarity of 2 documents def jaccard(c1, c2): n11 = sum((c1==1)&(c2==1)) n00 = sum((c1==0)&(c2==0)) jac = n11 / (nfeats-n00) return (jac) for i in range(0,ndocs): tru_sim[i,i]=1 for j in range(i+1,ndocs): tru_sim[i,j] = jaccard(rawdata[:,i],rawdata[:,j])

+6

performance python vectorization numpy data-mining

Magic8ball Nov 13 '16 at 22:15

source share

2 answers

To calculate Jaccard, use:

 def jaccard(x,y): x = np.asarray(x, np.bool) # Not necessary, if you keep your data y = np.asarray(y, np.bool) # in a boolean array already! return np.double(np.bitwise_and(x, y).sum()) / np.double(np.bitwise_or(x, y).sum()) print jaccard([1,1,0,0,0],[0,1,0,0,1]) >>> 0.33333333333333331

+2

Anony-mousse Nov 14 '16 at 13:25

source share

Divakar · Accepted Answer · 2016-11-13T22:35:35+0000

Here's the vector approach -

 # Get the row, col indices that are to be set in output array r,c = np.tril_indices(ndocs,-1) # Use those indicees to slice out respective columns p1 = rawdata[:,c] p2 = rawdata[:,r] # Perform n11 and n00 vectorized computations across all indexed columns n11v = ((p1==1) & (p2==1)).sum(0) n00v = ((p1==0) & (p2==0)).sum(0) # Finally, setup output array and set final division computations out = np.eye(ndocs) out[c,r] = n11v / (nfeats-n00v)

An alternative way to calculate n11v and n00v with np.einsum is

 n11v = np.einsum('ij,ij->j',(p1==1),(p2==1).astype(int)) n00v = np.einsum('ij,ij->j',(p1==0),(p2==0).astype(int))

If rawdata consists of only 0s and 1s , an easier way to get them is

 n11v = np.einsum('ij,ij->j',p1,p2) n00v = np.einsum('ij,ij->j',1-p1,1-p2)

Benchmarking

Function Definitions -

 def original_app(rawdata, ndocs, nfeats): tru_sim = np.zeros((ndocs,ndocs)) for i in range(0,ndocs): tru_sim[i,i]=1 for j in range(i+1,ndocs): tru_sim[i,j] = jaccard(rawdata[:,i],rawdata[:,j]) return tru_sim def vectorized_app(rawdata, ndocs, nfeats): r,c = np.tril_indices(ndocs,-1) p1 = rawdata[:,c] p2 = rawdata[:,r] n11v = ((p1==1) & (p2==1)).sum(0) n00v = ((p1==0) & (p2==0)).sum(0) out = np.eye(ndocs) out[c,r] = n11v / (nfeats-n00v) return out

Check and Timing -

 In [6]: # Setup inputs ...: rawdata = (np.random.rand(20,10000)>0.2).astype(int) ...: rawdata = np.transpose(rawdata) ...: ndocs = rawdata.shape[1] ...: nwords = rawdata.shape[0] ...: nfeats = 5 ...: In [7]: # Verify results ...: out1 = original_app(rawdata, ndocs, nfeats) ...: out2 = vectorized_app(rawdata, ndocs, nfeats) ...: print np.allclose(out1,out2) ...: True In [8]: %timeit original_app(rawdata, ndocs, nfeats) 1 loops, best of 3: 8.72 s per loop In [9]: %timeit vectorized_app(rawdata, ndocs, nfeats) 10 loops, best of 3: 27.6 ms per loop

Some magic 300x+ speed there!

So why is it so fast? Well, there are many factors, the most important of which is that NumPy arrays are built for performance and optimized for vectorized computing. With the proposed approach, we use it pretty well and, thus, we see such accelerations.

Here is one related Q&A that details these performance criteria.

Calculating Jaccard affinities in Python

More articles: