Get a list of all indices of duplicate elements in a numpy array

I am trying to get the index of all duplicate elements in a numpy array, but the solution I have found so far is REALLY inefficient for a large (> 20,000 elements) input array (it takes more or less than 9 seconds). The idea is simple:

  • records_array is a multidimensional array of timestamps from which we want to extract the indices of repeated timestamps

  • time_array is a numpy array containing all timestamps that are repeated in records_array

  • records is a QuerySet django (which can easily be converted to a list) containing some Record objects. We want to create a list of pairs formed by all possible combinations of record tagId attributes corresponding to repeating timestamps found from records_array .

Here is the working (but inefficient) code that I have at the moment:

 tag_couples = []; for t in time_array: users_inter = np.nonzero(records_array == t)[0] # Get all repeated timestamps in records_array for time t l = [str(records[i].tagId) for i in users_inter] # Create a temporary list containing all tagIds recorded at time t if l.count(l[0]) != len(l): #remove tuples formed by the first tag repeated tag_couples +=[x for x in itertools.combinations(list(set(l)),2)] # Remove duplicates with list(set(l)) and append all possible couple combinations to tag_couples 

I am sure that this can be optimized using Numpy, but I cannot find a way to compare records_array with each time_array element without using a for loop (this cannot compare simply using == , since they are both arrays).

+6
source share
4 answers

A solution based, as usual, on numpy on the magic of unique() , without loops and lists:

 records_array = array([1, 2, 3, 1, 1, 3, 4, 3, 2]) idx_sort = argsort(records_array) sorted_records_array = records_array[idx_sort] vals, idx_start, count = unique(sorted_records_array, return_counts=True, return_index=True) # sets of indices res = split(idx_sort, idx_start[1:]) #filter them with respect to their size, keeping only items occurring more than once vals = vals[count > 1] res = filter(lambda x: x.size > 1, res) 

EDIT: the following my previous answer required a bit more memory using numpy cast and call unique twice:

 records_array = array([1, 2, 3, 1, 1, 3, 4, 3, 2]) vals, inverse, count = unique(records_array, return_inverse=True, return_counts=True) idx_vals_repeated = where(count > 1)[0] vals_repeated = vals[idx_vals_repeated] rows, cols = where(inverse == idx_vals_repeated[:, newaxis]) _, inverse_rows = unique(rows, return_index=True) res = split(cols, inverse_rows[1:]) 

with the expected res = [array([0, 3, 4]), array([1, 8]), array([2, 5, 7])]

+12
source

You can also do this:

 a = [1,2,3,1,1,3,4,3,2] index_sets = [np.argwhere(i==a) for i in np.unique(a)] 

this will give you a set of arrays with indices of unique elements.

 [array([[0],[3],[4]], dtype=int64), array([[1],[8]], dtype=int64), array([[2],[5],[7]], dtype=int64), array([[6]], dtype=int64)] 

Added: A further change in the understanding of the list can also discard single unique values ​​and take into account the speed problem in the case of many unique unique elements:

 new_index_sets = [np.argwhere(i[0]== a) for i in np.array(np.unique(a, return_counts=True)).T if i[1]>=2] 

this gives:

 [array([[0],[3],[4]], dtype=int64), array([[1],[8]], dtype=int64), array([[2],[5],[7]], dtype=int64)] 
+2
source

You could do something line by line:

 1. add original index ref so [[1,0],[2,1],[3,2],[1,3],[1,4]... 2. sort on [:,0] 3. use np.where(ra[1:,0] != ra[:-1,0]) 4. use the list of indexes from above to construct your final list of lists 

EDIT - OK, so after my quick reply I left a long time ago and I see that they voted for me, which is true, since numpy.argsort() much better than my suggestion. I voted for numpy.unique() answer, as this is an interesting feature. However, if you use timeit, you will find that

 idx_start = np.where(sorted_records_array[:-1] != sorted_records_array[1:])[0] + 1 res = np.split(idx_sort, idx_start) 

a little faster than

 vals, idx_start = np.unique(sorted_records_array, return_index=True) res = np.split(idx_sort, idx_start[1:]) 

Next edit the following question @Nicolas

I'm not sure you can. It would be possible to get two arrays of indexes according to break points, but you cannot split different "lines" of the array into pieces of different sizes using np.split, therefore

 a = np.array([[4,27,42,12, 4 .. 240, 12], [3,65,23...] etc]) idx = np.argsort(a, axis=1) sorted_a = np.diagonal(a[:, idx[:]]).T idx_start = np.where(sorted_a[:,:-1] != sorted_a[:,1:]) # idx_start => (array([0,0,0,..1,1,..]), array([1,4,6,7..99,0,4,5])) 

but it can be good enough, depending on what you want to do with the information.

0
source

therefore, I could not get rid of the for loop, but I was able to associate it with minimal use of the for loop, using the set data type and the list.count() method:

 data = [1,2,3,1,4,5,2,2] indivs = set(data) multi_index = lambda lst, val: [i for i, x in enumerate(lst) if x == val] if data != list(indivs): dupes = [multi_index(data, i) for i in indivs if data.count(i) > 1] 

If you are looping your set of indivs that contains values ​​(no duplicates), then iterate over the complete list if you find an element with a duplicate. I am looking for an alternative to numpy if this is not fast enough for you. Generator objects can also speed this up if necessary.

Edit: gg349 answer contains a numpy solution I'm working on!

0
source

Source: https://habr.com/ru/post/986412/


All Articles