What algorithm is used to remove duplicates?

Imagine that we have a file called, for example, "A.txt". We know that there are some repeating elements. "A.txt" is very large, more than ten times larger than memory, maybe about 50 GB. Sometimes size B will be approximately equal to size A, sometimes it will be many times smaller than size A. Let it have such a structure:

a 1 b 2 c 445 a 1 

We need to get a "B.txt" file that will not have such duplicates. For example, it should be like this:

 a 1 b 2 c 445 

I was thinking of an algorithm that copies A and makes B, then takes the first line in B and looks for each other, if it finds the same thing, removes duplicates. Then takes the second line, etc.

But I think this method is too slow. What can i use?

A is not a database! No SQL, please.

Sorry, that didn't say sorting is ok.

Although it can be sorted, what if it cannot be sorted?

+4
source share
3 answers

Suppose you can put 1/k 'th of a file into memory and still have room for working data structures. The entire file can be processed in k or fewer passes, as shown below, and this can be much faster than sorting the entire file depending on the length of the line and the constants of the sorting algorithm. The sort values ​​are O(n ln n) , and the process below O(kn) worst case. For example, if the lines are 10 characters and there are n = 5G lines, ln(n) ~ 22.3 . Also, if your output file B much smaller than input file A , the process is likely to take only one or two passes.

Process:

  • Allocate a few megabytes for the input buffer I, several gigabytes for the resulting buffer R and gigabytes or so for the hash table H. Open the input file F and the output file O.
  • Repeat: fill in I from F and process it in R through step 3.
  • For each line L in I, check that L is already in H and R. If so, go to the next L, add L to R and its hash to H.
  • When R is full, say using M-records, write it on O. Then I re-fill I from F, unlock it, as in step 3, and write to O. In EOF (F) go to 5.
  • Repeat (using the old O as input F and the new O output for output): Read the M lines from F and copy to O. Then load R and H, as in steps 2 and 3, and copy to EOF (F) using release like before. Set M to a new number of lines without duplication at the beginning of each O file.

Please note that after each pass, the first M lines of O do not contain duplicates, and none of these M lines are duplicated in the rest of O. Thus, at least 1/k 'th of the source file is processed per pass, so processing does not take more than k .

Update 1 Instead of re-entering and reading in already processed leading lines, you should use a separate output file P, to which the buffer of the process R is added at the end of each pass. This reduces the number of reads and writes in k/2 when the result file B is almost the same , like A, or slightly less when B is much less than A; but in no case does it increase the amount of I / O.

+3
source

One solution would be to sort the file and then copy one line at a time to a new file, filtering out successive duplicates.

The question then becomes: how do you sort a file that is too large to fit in memory?

Here's how Unix sorting does it .

See also this question .

+6
source

You will essentially need to create a set of search results (if the language resembles database technology, it is no coincidence, no matter how much you hate the fact that databases handle the same questions as you).

One of the possible effective data structures for this is either a sorted range (implemented as a tree of some type) or a hash table. Since you are processing the file, you insert each record into your result set, efficiently, and at this point you can check if the result exists. When you are done, you will have a reduced set of unique entries.

Instead of duplicating the actual record, your result set can also store a link of some kind in any of the source records. It depends on whether the records are large enough to make this a more efficient solution.

Or you can simply add a mark to the source data whether or not to include the record.

(Also consider using an efficient storage format such as NetCDF for binary data, as the textual representation is much slower for access and processing).

+2
source

Source: https://habr.com/ru/post/1379479/


All Articles