Removing fast duplicates in numpy and python

Question

Removing fast duplicates in numpy and python

Is there any quick way to get unique elements in numpy? I have code like this (last line)

tab = numpy.arange(100000000) indices1 = numpy.random.permutation(10000) indices2 = indices1.copy() indices3 = indices1.copy() indices4 = indices1.copy() result = numpy.unique(numpy.array([tab[indices1], tab[indices2], tab[indices3], tab[indices4]]))

This is just an example, and in my situation indices1, indices2,...,indices4 contains a different set of indices and has a different size. The last line is executed many times, and infoticed that this is actually a bottleneck in my code ( {numpy.core.multiarray.arange} to be predictive). In addition, the ordering does not matter, and the element in the index array is of type int32 . I thought about using a hash table with the value of the element as the key and tried:

 seq = itertools.chain(tab[indices1].flatten(), tab[indices2].flatten(), tab[indices3].flatten(), tab[indices4].flatten()) myset = {} map(myset.__setitem__, seq, []) result = numpy.array(myset.keys())

but it was even worse.

Is there any way to speed this up? I assume that performance limitation comes from an “indexing fantasy” that copies an array, but I need a read-only result element (I don't change anything).

+6

performance optimization python arrays numpy

pzo Dec 23 '11 at 20:51

source share

2 answers

[Which is actually partially incorrect (see PS):]

The following method of obtaining unique elements in all subarrays is very fast:

 seq = itertools.chain(tab[indices1].flat, tab[indices2].flat, tab[indices3].flat, tab[indices4].flat) result = set(seq)

Note that flat (which returns an iterator) is used instead of flatten() (which returns a full array) and that set() can be called directly (instead of using map() and a dictionary, as in your second method).

Here are the synchronization results (obtained in the IPython shell):

 >>> %timeit result = numpy.unique(numpy.array([tab[indices1], tab[indices2], tab[indices3], tab[indices4]])) 100 loops, best of 3: 8.04 ms per loop >>> seq = itertools.chain(tab[indices1].flat, tab[indices2].flat, tab[indices3].flat, tab[indices4].flat) >>> %timeit set(seq) 1000000 loops, best of 3: 223 ns per loop

In this example, the set / flat method is therefore 40 times faster.

PS : set(seq) is actually not representative. In fact, the first timer loop empties the iterator seq , and subsequent evaluations of set() return an empty set! The right time test is next

 >>> %timeit set(itertools.chain(tab[indices1].flat, tab[indices2].flat, tab[indices3].flat, tab[indices4].flat)) 100 loops, best of 3: 9.12 ms per loop

which shows that the set / flat method is actually not faster.

PPS : here is a (unsuccessful) study of the mtrw sentence; looking for unique indexes in advance may be a good idea, but I cannot find a way to implement it faster than the above approach:

 >>> %timeit set(indices1).union(indices2).union(indices3).union(indices4) 100 loops, best of 3: 11.9 ms per loop >>> %timeit set(itertools.chain(indices1.flat, indices2.flat, indices3.flat, indices4.flat)) 100 loops, best of 3: 10.8 ms per loop

Thus, the search for the set of all individual indices is rather slow in itself.

numpy.unique(<concatenated array of indices>) : numpy.unique(<concatenated array of indices>) is actually 2-3 times faster than set(<concatenated array of indices>) . This is the key to the acceleration obtained in Bago's answer ( unique(concatenate((…))) ). Probably the reason is that letting NumPy process its arrays by itself is generally faster than pairing pure Python ( set ) with NumPy arrays.

Conclusion : this answer is, therefore, only documents with attempt errors that should not be fully respected, and also, perhaps, a useful note about temporary code with iterators ...

+4

Eol Dec 23 '11 at 21:25

source share

Bi rico · Accepted Answer · 2011-12-24T02:08:33+0000

Sorry, I do not quite understand your question, but I will do my best to help.

Fist {numpy.core.multiarray.arange} - numpy.arange not fancy indexing, unfortunately, fancy indexing does not appear in a separate article in the profiler. If you call np.arange in a loop, you should see if you can move it outside.

 In [27]: prun tab[tab] 2 function calls in 1.551 CPU seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 1.551 1.551 1.551 1.551 <string>:1(<module>) 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} In [28]: prun numpy.arange(10000000) 3 function calls in 0.051 CPU seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.047 0.047 0.047 0.047 {numpy.core.multiarray.arange} 1 0.003 0.003 0.051 0.051 <string>:1(<module>) 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}

Secondly, I assume that tab not np.arange(a, b) in your code, because if it is less than tab[index] == index + a , but I assume that this was only for your example.

Thirdly, np.concatenate is about 10 times faster than np.array

 In [47]: timeit numpy.array([tab[indices1], tab[indices2], tab[indices3], tab[indices4]]) 100 loops, best of 3: 5.11 ms per loop In [48]: timeit numpy.concatenate([tab[indices1], tab[indices2], tab[indices3], tab[indices4]]) 1000 loops, best of 3: 544 us per loop

(Also np.concatenate gives an array (4 * n,), and np.array gives an (4, n) array, where n is the length if the indices are [1-4]. The latter will work only if the indices 1-4 have the same length.)

Lastly, you can also save even more time if you can do the following:

 indices = np.unique(np.concatenate((indices1, indices2, indices3, indices4))) result = tab[indices]

Doing this in this order is faster because you reduce the number of indexes that you need to look for on the tab, but this will only work if you know that the tab elements are unique (otherwise you could get duplicates as a result even if the indexes are unique )

Hope that helps

Removing fast duplicates in numpy and python

More articles: