Suppose I have 1 GB of memory, how do I find duplicates among these URLs?
I saw one solution in Cracking the Coding Interview that suggested using a hash table to split these URLs into 4000 x.txt files, x = hash (u)% 4000 in the first scan. And in the second scan, we can check for duplicates in each x.txt file separately.
But how can I guarantee that each file will store about 1 GB of URL data? I think it is likely that some files will store much more url data than other files.
My solution to this problem is to do the file sharing trick iteratively until the files are small enough for the memory available to me.
Is there any other way to do this?
source
share