Given 10 billion URLs with an average length of 100 characters per URL, check the duplicate

Suppose I have 1 GB of memory, how do I find duplicates among these URLs?

I saw one solution in Cracking the Coding Interview that suggested using a hash table to split these URLs into 4000 x.txt files, x = hash (u)% 4000 in the first scan. And in the second scan, we can check for duplicates in each x.txt file separately.

But how can I guarantee that each file will store about 1 GB of URL data? I think it is likely that some files will store much more url data than other files.

My solution to this problem is to do the file sharing trick iteratively until the files are small enough for the memory available to me.

Is there any other way to do this?

+4
source share
3 answers

If you don't mind a solution that requires a bit more code, you can do the following:

  • Calculate only hash codes. Each hash code is exactly 4 bytes, so you have great control over the amount of memory that each piece of hash codes will occupy. You can also put a lot more hash codes in memory than URLs, so you will have fewer chunks.

  • Find duplicate hash codes. Presumably, there will be much less than 10 billion. They can even all fit in memory.

  • URL-, -, , URL -, URL-, - -. ( 10 URL- -, 4 , .)

+3

.

, , 1 . , . 1000 , 250 .

, - 4 . , . . .

, , - . , URL- 100 10 000 000 ? ! "" , .

+3

- -. 10G URL- , , 8 2- 4 64 .

2- 4G - 1 .

- 2..6 , - 01 ( 000001) ..

- .

But even this approach is subject to many steps for processing various statistics. Alternatively, you can create a patrician tree with a histogram count on a sheet. To save memory, some tokenizing / online dictionary can be added. When the 1 GB limit is reached, the largest entries are deleted from the tree, and smaller entries are added instead. The next pass will continue with the last URL in the tree as the starting point.

0
source

Source: https://habr.com/ru/post/1682575/


All Articles