I have a large list of integers (thousand), and I want to extract from it the first N (about 10-20) unique elements. Each integer in the list occurs approximately three times.
Writing an algorithm for this is trivial, but I wonder what the maximum speed and memory are, an efficient way to do this.
In my case, there are some additional restrictions and information:
In my use case, I retrieve my uniques several times in the array, each time skipping some elements from the beginning. The number of elements that I skip is unknown during unique extraction. I don’t even have an upper bound. Therefore, sorting is not speed efficient (I have to keep the order of the array).
Integers are everywhere, so a bit array is not possible as a search solution.
I want to avoid time distributions during the search at all costs.
My current solution looks something like this:
int num_uniques = 0;
int uniques[16];
int startpos = 0;
while ((num_uniques != N) && (start_pos < array_length))
{
int insert_position;
int element = array[startpos++];
if (!binary_search (uniques, element, num_uniques, &insert_position))
{
insert_into_array (uniques, element, insert_position);
uniques++;
}
}
The binary_search / insert algorithm in the array does the job, but the performance is low. Calling insert_into_array moves elements around a large number, and this slows everything down.
Any ideas?
EDIT
Great answers guys! Everyone deserves an accepted answer, but I can only give one. I implement a bunch of your ideas and do a shootout with some typical data. Anyone who has an idea that leads to the fastest implementation gets an accepted answer.
I ran the code on a modern PC and built-in CortexA8-CPU, and I will somehow weigh the results. Also publish the results.
EDIT: Shootout Results
Core-Duo, 100 160 .
Bruteforce (Pete): 203 ticks
Hash and Bruteforce (Antti): 219 ticks
Inplace Binary Tree (Steven): 390 ticks
Binary-Search (Nils): 438 ticks
http://torus.untergrund.net/code/unique_search_shootout.zip ( C testdata)
:
Inplace ( ).
Binary-Search 32 . .