I donβt think you can do better than O (n) time complexity - in the worst case, you have to touch each element of the data set to find a duplicate
One way to improve space consumption (due to the fact that some smoothing of the bits is required, as well as two passes over the data set), use the Bloom Filter . The idea is to make the first pass through the data set: if you find a possible duplicate, you will remove it from the data set and add it to the hash table (if the flowering filter functions correctly, then only about 1% of the elements will be marked as possible duplicates ) Then make a second pass through the filtered data set, checking the elements against a small hash table of possible duplicates.
My code will be in Java, as it is the language I am most familiar with.
Class DupFinder { BloomFilter filter = new BloomFilter(); HashTable hashTable = new HashTable(); int start = 0; int run(int[] dataset) { // first pass for(int i = 0; i < dataset.length; i++) { if(filter.contains(dataset[i]) { // check if element is in hashTable, else add it if(hashTable.contains(dataset[i]) { return dataset[i]; // duplicate found } else { hashTable.add(dataset[i]); } // remove element from dataset int temp = dataset[start]; dataset[start] = dataset[i]; dataset[i] = temp; start++; } else filter.add(dataset[i]); } // second pass for(int i = start; i < dataset.length; i++) { if(hashTable.contains(dataset[i]) { return dataset[i]; // duplicate found } } return NULL; // no duplicate found } }
An alternative to your hash table is to use Radix Sort , a linear time sorting algorithm. A Radix sort will have better worst case performance (O (n) for radix sorting compared to O (n ^ 2) for a hash table in the unlikely event you encounter an ridiculous number of collisions), but average performance is worse (hash implementation -tables, as a rule, finds a duplicate after scanning only half of the data set, while radix sorting always requires multiple passage through the data set). Radix Sort will also be marginally more efficient in terms of space consumption if you use an economical bucket space data structure, for example. list of starred.
You can parallelize the implementation of the hash table without the overhead of synchronization. Using t threads, each thread processes n / t data set elements (for example, if you have 32 elements in a data set and 2 threads, then thread0 processes elements 0-15 and thread1 processes elements 16-31), placing each element in a bucket with index absoluteValue (x modulo t) . After that, each thread will be responsible for processing all elements with a given bucket index, for example. thread0 will process all buckets with index 0. I use BlockingQueue for synchronization - this allows the thread to call take () on the queue, forcing the thread to delete the first element of the queue or block until the element is available; all other data structures are thread-local. You will need to initialize the dupFinders variable so that the DupFinder instance appears in the same index of each DupFinder dupFinders variable (for example, thread0 always appears in the 0th index, thereby ensuring that all elements in BlockingQueue have absoluteValue (x modulo) t) == 0 ).
Class DupFinder implements Callable<Integer> { private Class Chunk { int size = 0; int chunk = new int[64]; boolean add(int x) { if(size < 64) { chunk[size] = x; size++; return true; } else return false; } } int t = ??? // number of threads private BlockingQueue<Stack<Chunk>> queue = new LinkedBlockingQueue() private DupFinder[] dupFinders = new DupFinder[t]; private Stack<Chunk>[] stacks = new Stack<Chunk>[t]; void add(Stack<Chunk> stack) { queue.add(stack); } // the thread only receives n/t elements of the dataset int call(int[] partialDataset) { // partition dataset elements by their modulus(t) for(int i = 0; i < partialDataset.length; i++) { tempStack = stacks[Math.absoluteValue(partialDataset[i] modulo t)]; if(!tempStack.peek().add(partialDataset[i])) { Chunk chunk = new Chunk(); chunk.add(partialDataset[i]); tempStack.push(chunk); } } // distribute chunk stacks to the appropriate threads for(int i = 0; i < t; i++) { dupFinders[i].add(stacks[i]); } HashTable hashTable = new HashTable(); for(int i = 0; i < t; i++) { // wait for a chunk stack to become available Stack<Chunk> tempStack = queue.take(); while(!tempStack.isEmpty) { tempChunk = tempStack.pop(); for(int i = 0; i < tempChunk.size; i++) { if(hashTable.contains(tempChunk.chunk[i]) { return tempChunk.chunk[i]; // duplicate found } else { hashTable.add(tempChunk.chunk[i]); } } } } return NULL; // no duplicate found } }