Extract the first N unique integers from an array

I have a large list of integers (thousand), and I want to extract from it the first N (about 10-20) unique elements. Each integer in the list occurs approximately three times.

Writing an algorithm for this is trivial, but I wonder what the maximum speed and memory are, an efficient way to do this.

In my case, there are some additional restrictions and information:

  • In my use case, I retrieve my uniques several times in the array, each time skipping some elements from the beginning. The number of elements that I skip is unknown during unique extraction. I don’t even have an upper bound. Therefore, sorting is not speed efficient (I have to keep the order of the array).

  • Integers are everywhere, so a bit array is not possible as a search solution.

  • I want to avoid time distributions during the search at all costs.

My current solution looks something like this:

  int num_uniques = 0;
  int uniques[16];
  int startpos = 0;

  while ((num_uniques != N) && (start_pos < array_length))
  {
    // a temporary used later.
    int insert_position;

    // Get next element.
    int element = array[startpos++];

    // check if the element exist. If the element is not found
    // return the position where it could be inserted while keeping
    // the array sorted.

    if (!binary_search (uniques, element, num_uniques, &insert_position))
    {

      // insert the new unique element while preserving 
      // the order of the array.

      insert_into_array (uniques, element, insert_position);

      uniques++;
    }
  }

The binary_search / insert algorithm in the array does the job, but the performance is low. Calling insert_into_array moves elements around a large number, and this slows everything down.

Any ideas?


EDIT

Great answers guys! Everyone deserves an accepted answer, but I can only give one. I implement a bunch of your ideas and do a shootout with some typical data. Anyone who has an idea that leads to the fastest implementation gets an accepted answer.

I ran the code on a modern PC and built-in CortexA8-CPU, and I will somehow weigh the results. Also publish the results.


EDIT: Shootout Results

Core-Duo, 100 160 .

Bruteforce (Pete):            203 ticks
Hash and Bruteforce (Antti):  219 ticks
Inplace Binary Tree (Steven): 390 ticks
Binary-Search (Nils):         438 ticks

http://torus.untergrund.net/code/unique_search_shootout.zip ( C testdata)

:

  • Inplace ( ).

  • Binary-Search 32 . .

+3
8

( , 20 , 10, ), , .

+4

std:: set , N ? . , , () (), <.

+11

uniques . uniques, , . ( --- .) , , 16 .

, , . " ", , .

(EDIT: -, , . . lisp -, , . , , C .)

+4

, , O(n) O(1) . , ?

" ", .

+4

. .

+3

. 3N.

arr [i] =

arr [i + 1] =

arr [i + 2] =

"" k, k , [i + 1] [i + 2] . , .

.

3 42243123: = 3 * 3 = 9.

"v" - , "l" - , "r" - .

 v  l  r  v  l  r  v  l  r
 _________________________
-1 -1 -1 -1 -1 -1 -1 -1 -1
 4 -1 -1 -1 -1 -1 -1 -1 -1
 4  3 -1  2 -1 -1 -1 -1 -1
 4  3 -1  2 -1 -1 -1 -1 -1
 4  3 -1  2 -1 -1 -1 -1 -1
 4  3 -1  2 -1  6  3 -1 -1

.

0 mod 3 - .

, 4:

[i] =

[i + 1] =

[i + 2] =

[i + 3] =

+3

, , N , N (1 + e) ​​ e (, ).

, N uniques. K K/2 , (N ^ 2)/4 . N * (log (N) -1) . (N ^ 2)/4 + N (log (N) -1) + N (1 + e) ​​ .

, , . :

int num_uniques = 0, startpos = 0, k, element;
int uniques[16];

/* Allocate and clear a bit table of 32 * 32 = 1024 bits. */
uint32 bit_table[32], hash;
memzero((void *)(&bit_table), sizeof(bit_table));

while (num_uniques < N && startpos < array_length) {
  element = array[startpos++];

  /* Hash the element quickly to a number from 0..1023 */
  hash = element ^ (element >> 16);
  hash *= 0x19191919;
  hash >>= 22;
  hash &= 1023;

  /* Map the hash value to a bit in the bit table.
     Use the low 5 bits of 'hash' to index bit_table
     and the other 5 bits to get the actual bit. */
  uint32 slot=hash & 31;
  uint32 bit=(1u << (hash >> 5));

  /* If the bit is NOT set, this is element is guaranteed unique. */
  if (!(bit_table[slot] & bit)) {
    bit_table[slot] |= bit;
    uniques[num_uniques++] = element;
  } else { /* Otherwise it can be still unique with probability
              num_uniques / 1024. */
    for (k=0; k<num_uniques; k++) { if (uniques[k] == element) break }
    if (k==num_uniques) uniques[num_uniques++] = element;
  }
}

N + N ^ 2/128, ( k) .

+2

N L

L , .

(1 ) (.. ) A.

L, L (i) A, Increment , .

. L , A (i). , A (i) > 2 .

, A.

, 2

00 count = 0
01 count = 1
10 count = 2
11 count > 2
0

Source: https://habr.com/ru/post/1705094/


All Articles