How to determine row equality across multiple lines in a fully vectorized way?

Question

How to determine row equality across multiple lines in a fully vectorized way?

I have a numpy.ndarray called grouping size (S, N) . Each grouping line gives me the group labels for the sample data. I run my algorithm S times and get new group labels at each iteration.

I want to determine how many times each sample of my data has the same group label as any other sample of my data through iterations of S fully vectorized way.

In a not fully vectorized form:

 sim_matrix = np.zeros((N, N)) for s in range(S): sim_matrix += np.equal.outer(grouping[s, :], grouping[s, :])

+5

python vectorization numpy similarity

NewNameStat Mar 10 '17 at 15:46

source share

3 answers

Divakar · Answer 1 · 2017-03-10T16:27:08+0000

One vector approach will be with broadcasting -

 (grouping[:,None,:] == grouping[:,:,None]).sum(0)

For performance, we can use np.count_nonzero -

 np.count_nonzero(grouping[:,None,:] == grouping[:,:,None],axis=0)

BM · Answer 2 · 2017-03-10T17:31:16+0000

IIUC, you want to compare identical strings. The way to do this is to group all the lines in a raw block:

 S,N=12,2 a=np.random.randint(0,3,(S,N)) #12 samples of two labels. #a 0 1 0 2 2 1 2 0 2 1 2 3 0 0 4 0 1 5 1 1 6 0 1 7 0 1 8 0 1 9 0 0 10 2 2 11 0 0 samples=np.ascontiguousarray(a).view(dtype((void,a.strides[0])))

sample.shape then (S, 1).

Now you can inventory your sample using np.unique and use Pandas dataframes to report:

 _,inds,invs=np.unique(samples,return_index=True, return_inverse=True) df=pd.DataFrame(invs) result=df.reset_index().groupby(0).index.apply(list).to_frame() result['sample']=[list(x) for x in a[inds]]

for

  index samples 0 0 [3, 9, 11] [0, 0] 1 [4, 6, 7, 8] [0, 1] 2 [5] [1, 1] 3 [2] [1, 2] 4 [1] [2, 0] 5 [0, 10] [2, 2]

It can be O (S ln S), if there is not much between the samples, if your O (N²S).

denis · Answer 3 · 2017-03-16T15:07:52+0000

The sum of equal.outer is a mysterious way of calculating the similarity of columns across all pairs:

 sum_i sum_jk (A[i,j] == A[i,k]) is the same as sum_jk sum_i (A[i,j] == A[i,k])

where sum_i intersects the rows, sum_jk over all pairs of columns. Comparing two vectors by counting the number of positions at which they differ is called the Hamming distance . If we change == above to != , The similarity to distance = nrows is a similarity (the most similar & harr; distance is 0), we get the problem: find the Hamming distance between all pairs of a bunch of vectors :.

 def allpairs_hamming( A, dtype=np.uint32 ): """ -> Hamming distances between all pairs of rows of A """ nrow, ncol = A.shape allpair_dist = np.zeros( [nrow, nrow], dtype=dtype ) for j in xrange(nrow): for k in xrange( j + 1, nrow ): allpair_dist[j,k] = allpair_dist[k,j] = (A[j] != A[k]).sum() # row diff return allpair_dist allpairs_hamming: 30.7 sec, 3 ns per cmp Nvec 2000 Veclen 5000 A 10m pairdist uint32 15m

Almost all the time the processor is in the diff line, and not in the outer loop for j ... for k - 3 ns for scalar comparison, on a layout with a margin, not bad. However, memory caching is much faster if each line A[j] is in continuous memory, as for massive C-order arrays. In addition, regardless of whether you are performing “all pairs of rows” or “all pairs of columns”, it does not matter if you are understandable.

(Is it possible to find “close” pairs in time and space <O (npairs), here is O (20000 ^ 2)? Afaik there are more methods than in test cases.)

See also:
http://docs.scipy.org/doc/scipy/reference/spatial.distance.html (error: hamming.mean not.sum)
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html
https://stats.stackexchange.com/search?q=[clustering†+pairwise

How to determine row equality across multiple lines in a fully vectorized way?

More articles: