Removing duplicates from numPy array list

I have a regular Python list that contains (multidimensional) numPy arrays, all the same shapes and with the same number of values. Some of the arrays in the list are duplicates of the previous ones.

I have a problem with the fact that I want to remove all duplicates, but the fact that the data type is a numPy array complicates this a bit ...

β€’ I cannot use set () because numPy arrays are not hashable.
β€’ I cannot check for duplicates during insertion, as arrays are generated by function packages and added to the list using .extend ().
β€’ The numPy arrays are not directly comparable without resorting to one of the numPy native functions, so I can’t just go on to what uses the "if x in list" ...
β€’ The contents of the list should remain in the numPy array at the end of the process; I could compare copies of arrays converted to nested lists, but I can't constantly convert arrays to direct python lists.

Any suggestions for efficient duplicate removal here?

+2
source share
3 answers

Using the solutions here: The most efficient hash property for a numpy array , we see that hashing works best with a.tostring () if a is a numpy array. So:

import numpy as np arraylist = [np.array([1,2,3,4]), np.array([1,2,3,4]), np.array([1,3,2,4])] L = {array.tostring(): array for array in arraylist} L.values() # [array([1, 3, 2, 4]), array([1, 2, 3, 4])] 
+1
source

Depending on the structure of your data, it may be easier to directly compare all arrays rather than finding any way to hash the arrays. The algorithm is O (n ^ 2), but each individual comparison will be much faster than creating python strings or lists of your arrays. So it depends on how many arrays you have to check.

eg.

 uniques = [] for arr in possible_duplicates: if not any(numpy.array_equal(arr, unique_arr) for unique_arr in uniques): uniques.append(arr) 
+3
source

Here is one way: tuple :

 >>> import numpy as np >>> t = [np.asarray([1, 2, 3, 4]), np.asarray([1, 2, 3, 4]), np.asarray([1, 1, 3, 4])] >>> map(np.asarray, set(map(tuple, t))) [array([1, 1, 3, 4]), array([1, 2, 3, 4])] 

If your arrays are multidimensional, first flatten them to an array one by one, then use the same idea and reformat them at the end:

 def to_tuple(arr): return tuple(arr.reshape((arr.size,))) def from_tuple(tup, original_shape): np.asarray(tup).reshape(original_shape) 

Example:

 In [64]: t = np.asarray([[[1,2,3],[4,5,6]], [[1,1,3],[4,5,6]], [[1,2,3],[4,5,6]]]) In [65]: map(lambda x: from_tuple(x, t[0].shape), set(map(to_tuple, t))) Out[65]: [array([[1, 2, 3], [4, 5, 6]]), array([[1, 1, 3], [4, 5, 6]])] 

Another option is to create pandas.DataFrame from ndarrays (treating them as strings by changing if necessary) and using pandas built-in modules to identify strings.

 In [34]: t Out[34]: [array([1, 2, 3, 4]), array([1, 2, 3, 4]), array([1, 1, 3, 4])] In [35]: pandas.DataFrame(t).drop_duplicates().values Out[35]: array([[1, 2, 3, 4], [1, 1, 3, 4]]) 

In general, it seems like a bad idea to try to use tostring() as a quasi-hash function, because you need more boiler plate code than in my approach to protect against the possibility that some of the content is mutated after they got theirs " hash key "in some dict .

If converting and converting to tuple too slow, given the size of the data, I feel this is revealing a more fundamental problem: the application is not well designed around needs (e.g. removing duplicates) and trying to squeeze them into some kind of Python process running in memory, probably not so. At this point, I would dwell on the fact that not something like Cassandra, which can easily create database indexes on top of large columns (or multidimensional arrays) of floating point data (or others), is not a more sensible approach.

+2
source

Source: https://habr.com/ru/post/1270965/


All Articles