How to get unique strings and their occurrences for a 2D array?

I have a 2D array and it has several repeating columns. I would like to see what unique columns are and where duplicates are.

My own array is too large to enter here, but here is an example:

a = np.array([[ 1.,  0.,  0.,  0.,  0.],[ 2.,  0.,  4.,  3.,  0.],])

It has a unique column vectors [1.,2.], [0.,0.], [0.,4.]and [0.,3.]. There is one duplicate: [0.,0.]appears twice.

Now I have found a way to get unique vectors and their indices here , but it is not clear to me how I will get duplicates as well. I tried several naive ways (with np.whereand a list of comps), but they are all very slow. Surely there should be a multi-bubble way?

In matlab is only a function unique, but np.uniquealigns arrays.

+4
source share
3 answers

Here's a vectorial approach that gives us a list of arrays as output -

ids = np.ravel_multi_index(a.astype(int),a.max(1).astype(int)+1)
sidx = ids.argsort()
sorted_ids = ids[sidx]
out = np.split(sidx,np.nonzero(sorted_ids[1:] > sorted_ids[:-1])[0]+1)

Run Example -

In [62]: a
Out[62]: 
array([[ 1.,  0.,  0.,  0.,  0.],
       [ 2.,  0.,  4.,  3.,  0.]])

In [63]: out
Out[63]: [array([1, 4]), array([3]), array([2]), array([0])]
0
source

The numpy_indexed package (disclaimer: I am the author) contains effective functionality for computing such things:

import numpy_indexed as npi
unique_columns = npi.unique(a, axis=1)
non_unique_column_idx = npi.multiplicity(a, axis=1) > 1

Or alternatively:

unique_columns, column_count = npi.count(a, axis=1)
duplicate_columns = unique_columns[:, column_count > 1]
0
source

For small arrays:

    from collections import defaultdict
    indices = defaultdict(list)
    for index, column in enumerate(a.transpose()):
        indices[tuple(column)].append(index)
    unique = [kk for kk, vv in indices.items() if len(vv) == 1]
    non_unique = {kk:vv for kk, vv in indices.items() if len(vv) != 1}
-1
source

Source: https://habr.com/ru/post/1656958/


All Articles