Finding duplicate matrices in Python?

Question

Finding duplicate matrices in Python?

I have a.shape: (80000, 38, 38) matrix a.shape: (80000, 38, 38) . I want to check and see if there are duplicates or similar matrices (38,38) along the first dimension (in this case there are 80,000).

I could skip two for :

 for i in range(a.shape[0]): for g in range(a.shape[0]): if a[i,:,:] - a[g,:,:] < tolerance: # save the index here

But it seems incredibly inefficient. I know there is numpy.unique, but I'm not sure I understand how this works when you have a set of two-dimensional matrices.

Suggestions for an effective way to do this? Is there a way to get translation to find differences in all elements in all matrices?

+5

python vectorization numpy matrix duplicates

user1566200 Jan 25 '16 at 17:57

source share

1 answer

Divakar · Accepted Answer · 2016-01-25T18:24:08+0000

Finding exact duplicate blocks

The lex-sorting approach is used here -

 # Reshape a to a 2D as required in few places later on ar = a.reshape(a.shape[0],-1) # Get lex-sorted indices sortidx = np.lexsort(ar.T) # Lex-sort reshaped array to bring duplicate rows next to each other. # Perform differentiation to check for rows that have at least one non-zero # as those represent unique rows and as such those are unique blocks # in axes(1,2) for the original 3D array out = a[sortidx][np.append(True,(np.diff(ar[sortidx],axis=0)!=0).any(1))]

Here's another approach that treats each block of elements in axes=(1,2) as an indexing tuple to find out how unique it is among the other blocks -

 # Reshape a to a 2D as required in few places later on ar = a.reshape(a.shape[0],-1) # Get dimension shape considering each block in axes(1,2) as an indexing tuple dims = np.append(1,(ar[:,:-1].max(0)+1).cumprod()) # Finally get unique indexing tuples' indices that represent unique # indices along first axis for indexing into input array and thus get # the desired output of unique blocks along the axes(1,2) out = a[np.unique(ar.dot(dims),return_index=True)[1]]

Run Example -

1] Input:

 In [151]: a Out[151]: array([[[12, 4], [ 0, 1]], [[ 2, 4], [ 3, 2]], [[12, 4], [ 0, 1]], [[ 3, 4], [ 1, 3]], [[ 2, 4], [ 3, 2]], [[ 3, 0], [ 2, 1]]])

2] Output:

 In [152]: ar = a.reshape(a.shape[0],-1) ...: sortidx = np.lexsort(ar.T) ...: In [153]: a[sortidx][np.append(True,(np.diff(ar[sortidx],axis=0)!=0).any(1))] Out[153]: array([[[12, 4], [ 0, 1]], [[ 3, 0], [ 2, 1]], [[ 2, 4], [ 3, 2]], [[ 3, 4], [ 1, 3]]]) In [154]: dims = np.append(1,(ar[:,:-1].max(0)+1).cumprod()) In [155]: a[np.unique(ar.dot(dims),return_index=True)[1]] Out[155]: array([[[12, 4], [ 0, 1]], [[ 3, 0], [ 2, 1]], [[ 2, 4], [ 3, 2]], [[ 3, 4], [ 1, 3]]])

Detecting Similar Blocks

For similarity criteria, assuming that you mean absolute values (a[i,:,:] - a[g,:,:]).all() < tolerance , a vector approach is used here to get the indices of all similar blocks along axes(1,2) in the input array -

 R,C = np.triu_indices(a.shape[0],1) mask = (np.abs(a[R] - a[C]) < tolerance).all(axis=(1,2)) I,G = R[mask], C[mask]

Run Example -

 In [267]: a Out[267]: array([[[12, 4], [ 0, 1]], [[ 2, 4], [ 3, 2]], [[13, 4], [ 0, 1]], [[ 3, 4], [ 1, 3]], [[ 2, 4], [ 3, 2]], [[12, 5], [ 1, 1]]]) In [268]: tolerance = 2 In [269]: R,C = np.triu_indices(a.shape[0],1) ...: mask = (np.abs(a[R] - a[C]) < tolerance).all(axis=(1,2)) ...: I,G = R[mask], C[mask] ...: In [270]: I Out[270]: array([0, 0, 1, 2]) In [271]: G Out[271]: array([2, 5, 4, 5])

Finding duplicate matrices in Python?

Finding exact duplicate blocks

Detecting Similar Blocks

More articles: