Find a duplicate number in an array that has no duplicates except one number

Question

Find a duplicate number in an array that has no duplicates except one number

Suppose there is an array of elements that has no duplicates except 1 number,

ex. 1,2,13,4,7,11,2,6

How to find the duplicate number in an effective way? we can do this using a hash table (HT) in O (n) time and with O (n) space, as shown below.

 if(HT.Contains(item)) -> this is the duplicate else ht.add(item)

Is there a better way in terms of the complexity of space and time?

Note: this problem is not a duplicate below the two problems that are different.

If the integers are sequential, the solution in this link can be used how-to-find-a-duplicate-element-in-an-array-of-shuffled-consecutive-integers

If an array of n elements contains elements from 0 to n-1, only this link has the solution Search for duplicates in O (n) time and O (1) space

+5

c arrays algorithm data-structures

CRM Sep 13 '14 at 10:20

source share

2 answers

Zim-Zam O'Pootertoot · Answer 1 · 2014-09-14T06:34:55+0000

I don’t think you can do better than O (n) time complexity - in the worst case, you have to touch each element of the data set to find a duplicate

One way to improve space consumption (due to the fact that some smoothing of the bits is required, as well as two passes over the data set), use the Bloom Filter . The idea is to make the first pass through the data set: if you find a possible duplicate, you will remove it from the data set and add it to the hash table (if the flowering filter functions correctly, then only about 1% of the elements will be marked as possible duplicates ) Then make a second pass through the filtered data set, checking the elements against a small hash table of possible duplicates.

My code will be in Java, as it is the language I am most familiar with.

 Class DupFinder { BloomFilter filter = new BloomFilter(); HashTable hashTable = new HashTable(); int start = 0; int run(int[] dataset) { // first pass for(int i = 0; i < dataset.length; i++) { if(filter.contains(dataset[i]) { // check if element is in hashTable, else add it if(hashTable.contains(dataset[i]) { return dataset[i]; // duplicate found } else { hashTable.add(dataset[i]); } // remove element from dataset int temp = dataset[start]; dataset[start] = dataset[i]; dataset[i] = temp; start++; } else filter.add(dataset[i]); } // second pass for(int i = start; i < dataset.length; i++) { if(hashTable.contains(dataset[i]) { return dataset[i]; // duplicate found } } return NULL; // no duplicate found } }

An alternative to your hash table is to use Radix Sort , a linear time sorting algorithm. A Radix sort will have better worst case performance (O (n) for radix sorting compared to O (n ^ 2) for a hash table in the unlikely event you encounter an ridiculous number of collisions), but average performance is worse (hash implementation -tables, as a rule, finds a duplicate after scanning only half of the data set, while radix sorting always requires multiple passage through the data set). Radix Sort will also be marginally more efficient in terms of space consumption if you use an economical bucket space data structure, for example. list of starred.

You can parallelize the implementation of the hash table without the overhead of synchronization. Using t threads, each thread processes n / t data set elements (for example, if you have 32 elements in a data set and 2 threads, then thread0 processes elements 0-15 and thread1 processes elements 16-31), placing each element in a bucket with index absoluteValue (x modulo t) . After that, each thread will be responsible for processing all elements with a given bucket index, for example. thread0 will process all buckets with index 0. I use BlockingQueue for synchronization - this allows the thread to call take () on the queue, forcing the thread to delete the first element of the queue or block until the element is available; all other data structures are thread-local. You will need to initialize the dupFinders variable so that the DupFinder instance appears in the same index of each DupFinder dupFinders variable (for example, thread0 always appears in the 0th index, thereby ensuring that all elements in BlockingQueue have absoluteValue (x modulo) t) == 0 ).

 Class DupFinder implements Callable<Integer> { private Class Chunk { int size = 0; int chunk = new int[64]; boolean add(int x) { if(size < 64) { chunk[size] = x; size++; return true; } else return false; } } int t = ??? // number of threads private BlockingQueue<Stack<Chunk>> queue = new LinkedBlockingQueue() private DupFinder[] dupFinders = new DupFinder[t]; private Stack<Chunk>[] stacks = new Stack<Chunk>[t]; void add(Stack<Chunk> stack) { queue.add(stack); } // the thread only receives n/t elements of the dataset int call(int[] partialDataset) { // partition dataset elements by their modulus(t) for(int i = 0; i < partialDataset.length; i++) { tempStack = stacks[Math.absoluteValue(partialDataset[i] modulo t)]; if(!tempStack.peek().add(partialDataset[i])) { Chunk chunk = new Chunk(); chunk.add(partialDataset[i]); tempStack.push(chunk); } } // distribute chunk stacks to the appropriate threads for(int i = 0; i < t; i++) { dupFinders[i].add(stacks[i]); } HashTable hashTable = new HashTable(); for(int i = 0; i < t; i++) { // wait for a chunk stack to become available Stack<Chunk> tempStack = queue.take(); while(!tempStack.isEmpty) { tempChunk = tempStack.pop(); for(int i = 0; i < tempChunk.size; i++) { if(hashTable.contains(tempChunk.chunk[i]) { return tempChunk.chunk[i]; // duplicate found } else { hashTable.add(tempChunk.chunk[i]); } } } } return NULL; // no duplicate found } }

Dawid · Answer 2 · 2014-09-13T13:20:43+0000

Operations with single bits take a lot of time (it looks like: get a word, get / set 1 bit, set a word), compare with word operations (get / set word).

If you know that MIN_VALUE> = 0, you also know MAX_VALUE, and it is small enough, you can do something like the proposed Jongware hash table, but not bit by bit: hashed values are just values.

 #include <stdio.h> #include <string.h> #define MAX_VALUE 13 +1 // +1 so we don't have do -1 in for loop main() { int i; int array[] = { 1,2,13,4,7,11,2,6 }; int array_size = sizeof(array) / sizeof(array[0]); short flags[MAX_VALUE] = { 0 }; for (i = 0; i < array_size; ++i) { if (++flags[ array[i] ] != 1) { printf ("duplicated %d on %d\th position", array[i], i); } } }

And it also does not require a hash calculation for each element.

Find a duplicate number in an array that has no duplicates except one number

More articles: