Here is one way: tuple :
>>> import numpy as np >>> t = [np.asarray([1, 2, 3, 4]), np.asarray([1, 2, 3, 4]), np.asarray([1, 1, 3, 4])] >>> map(np.asarray, set(map(tuple, t))) [array([1, 1, 3, 4]), array([1, 2, 3, 4])]
If your arrays are multidimensional, first flatten them to an array one by one, then use the same idea and reformat them at the end:
def to_tuple(arr): return tuple(arr.reshape((arr.size,))) def from_tuple(tup, original_shape): np.asarray(tup).reshape(original_shape)
Example:
In [64]: t = np.asarray([[[1,2,3],[4,5,6]], [[1,1,3],[4,5,6]], [[1,2,3],[4,5,6]]]) In [65]: map(lambda x: from_tuple(x, t[0].shape), set(map(to_tuple, t))) Out[65]: [array([[1, 2, 3], [4, 5, 6]]), array([[1, 1, 3], [4, 5, 6]])]
Another option is to create pandas.DataFrame from ndarrays (treating them as strings by changing if necessary) and using pandas built-in modules to identify strings.
In [34]: t Out[34]: [array([1, 2, 3, 4]), array([1, 2, 3, 4]), array([1, 1, 3, 4])] In [35]: pandas.DataFrame(t).drop_duplicates().values Out[35]: array([[1, 2, 3, 4], [1, 1, 3, 4]])
In general, it seems like a bad idea to try to use tostring() as a quasi-hash function, because you need more boiler plate code than in my approach to protect against the possibility that some of the content is mutated after they got theirs " hash key "in some dict .
If converting and converting to tuple too slow, given the size of the data, I feel this is revealing a more fundamental problem: the application is not well designed around needs (e.g. removing duplicates) and trying to squeeze them into some kind of Python process running in memory, probably not so. At this point, I would dwell on the fact that not something like Cassandra, which can easily create database indexes on top of large columns (or multidimensional arrays) of floating point data (or others), is not a more sensible approach.