Effectively delete all duplicate entries

I have a file that can be 30 + GB or more. And each row in this file is called a record and consists of 2 columns, which is as follows

id1 id2

All these 2 identifiers are integer (32-bit). My task is to write a program to delete the entire duplicate entry, make the entry unique, and finally output the unique id2 to the file.

There are several restrictions: 30 GB of memory is allowed maximum and it is better to do the work with the help of a non-multi-threaded / technological program .

Initially, I came up with the idea: due to memory limitations, I decided to read the file n times, each of which stores only these records with id1 % n = i (i = 0,1,2,..,n-1) in memory. I use the data structure std::map<int, std::set<int> > , it takes id1 as a key and puts id2 in id1 std::set .

Thus, the memory limits will not be violated, but it is rather slow. I think, because as std::map and std::set grow larger, the insertion speed decreases. Moreover, I need to read the file n times when each round is completed, I have to clear std::map for the next round, which also costs some time.

I also tried the hash , but it doesn’t satisfy me either, and I thought there might be too many conflicts even with 300 watts .

So, I post my problem here, help you guys suggest me a better data structure or algorithm.

Thank you very much.

PS

Scripts (shell, python) are desirable if they can do this efficiently.

+4
source share
4 answers

If I have not missed the requirement, it should be possible to do in the Linux shell as

 sort -u inputfile > outputfile 

Many implementations allow you to use sort in parallel order:

 sort --parallel=4 -u inputfile > outputfile 

for four parallel versions.

Note that sort can temporarily use a lot of space in /tmp . If you run out of free disk space, you can use the -T option to point it to an alternate disk space for use as a temporary directory.


(Edit :) A few notes about efficiency:

  • A significant part of the time spent on execution (any solution to your problem) will be spent on IO, which is optimized for sort .
  • If you have too much RAM, your decision will most likely end some of the actions on disk (e.g. sort ). Again, optimization means a lot of work, but for sort it was all done.
  • One of the drawbacks of sort is that it works with string representations of input strings. If you wrote your own code, one thing could be done (similar to what you already assumed) - to convert the input strings to 64-bit integers and their hash. If you have enough RAM, this might be a way to beat sort in terms of speed if you get fast I / O and integer conversions. I suspect this may not be worth the effort, since sort is easy to use and - I think ndash; fast enough.
+8
source

I just don't think you can do this efficiently without using a bunch of disk. Any form of data structure will have so much memory and / or memory that your algorithm will suffer. So I would expect the sorting solution to be the best.

I believe that you can sort large chunks of a file at a time, and then merge (i.e. from merge-sort) those chunks after. After sorting the piece, obviously, it should return to disk. You can simply replace the data in the input file (assuming it is binary) or write to a temporary file.

As for the records, you just have a bunch of 64-bit values. With 30 GB of RAM, you can store almost 4 billion records at a time. It is very cute. You can sort so much locally with quicksort, or half as many with mergesort. You probably won't get a contiguous block of memory of this size. So you have to break it. This will make quicksort a bit more complicated, so you can also use mergesort in RAM.

During the final merger, it is trivial to delete duplicates. Merging can be completely file-based, but in the worst case, you will use the number of disks equivalent to twice the number of entries in the input file (one file for zero and one file for output). If you can use the input file as a scraper, then you have not exceeded the RAM limit or the limits of your disk (if any).

I think the key here is the requirement that it not be multithreaded. This is well suited for disk storage. Most of your time will be spent accessing the disk. Therefore, you want you to do this as efficiently as possible. In particular, when you sort by volume, you want to minimize the number of searches. You have a large amount of memory as a buffer, so I'm sure you can make it very efficient.

So, let's say your file is 60 GB (and I assume it is binary), so there are about 8 billion records. If you sort by volume in RAM, you can process 15 GB at a time. This means reading and (more) writing the file once. Now there are four pieces. If you want to make a pure merge-sort method, you are always dealing with two arrays. This means that you read and write the file two more times: once to merge each 15-gigabyte fragment into 30 GB and one final merge with them (including dropping duplicates).

I do not think this is too bad. Three times in and out. If you find a good way to quickly sort, then you can probably do this with fewer passes through the file. I believe a data structure such as deque will work well, as it can handle non-contiguous chunks of memory ... But you probably want to create your own and fine-tune your sorting algorithm to use it.

+1
source

Instead of std::map<int, std::set<int> > use std::unordered_multimap<int,int> . If you cannot use C ++ 11, write your own.

std::map based on node and it calls malloc for each insert, which is probably why it is slow. If you know the number of records, you can pre-allocate a map with an unsatisfactory map (hash table). Even if you do not, the number of mallocs will be O(log N) instead of O(N) with std::map .

I can argue that it will be several times faster and more efficient than memory than using external sort -u .

+1
source

This approach can help when there are not many duplicate entries in the file.

1st pass. Allocate most of the Bloom filter memory. Hash each pair from the input file and put the result in the Bloom filter. Write each duplicate found by the Bloom filter to a temporary file (this file will also contain a number of false positives that are not duplicates).

2nd passage. Download the temporary file and build a map from your records. The key is std::pair<int,int> , the value is Boolean. This map can be implemented either as std :: unordered_map / boost :: unordered_map, or as std :: map.

Third pass. Read the input file again, find each entry on the map, print its id2 , if either it is not found or the flag is not yet set, and then set this flag.

+1
source

Source: https://habr.com/ru/post/1434523/


All Articles