Given a 1 TB data set on a disk of about 1 KB for data recording, how can I find duplicates using 512 MB RAM and infinite disk space?

There is 1 TB data on the disk with a volume of about 1 KB for data recording. How to find duplicates using 512 MB of RAM and infinite disk space?

+45
c ++ algorithm data-structures
Apr 04 '10 at 5:21
source share
8 answers

Use a Bloom filter : a table of concurrent hashes. According to Wikipedia, the optimal number of hashes is ln(2) * 2^32 / 2^30 ≈ 2.77 ≈ 3 . (Hmm, turning on 4 gives fewer false positives, but 3 is even better for this application). This means that you have a 512 megabyte or 4 gigabit table, and processing each record sets three new bits in this huge sea. If all three bits are already set, this is a potential match. Write three hash values ​​to a file. Otherwise, write them to another file. Pay attention to the record index along with each match.

(If a 5% error probability is acceptable, omit the large file and use the small file as the result.)

When you are done, you should have a file of approximately 49M possible positive matches and a file with negative values ​​of 975M, which may still coincide with positives. Read it in vector<pair<vector<uint32_t>,vector<uint32_t> > > (the indices in the last vector , the first can be an array ) and sort it. Put indexes on another vector<uint32_t> ; they are already sorted. Read the large file, but instead of setting the bits to the table, find the hash values ​​in vector . (For example, use equal_range .) Use the index list of positive files to track the index of the current record in the negative file. If no match is found, ignore. Otherwise, add match->second.push_back(current_negative_record_index) index.

Finally, iterations over the map and record index vectors. Any bucket with more than one entry “almost” probably contains a set of duplicates, but you have come this far, so look at them and compare them completely to be sure.

General synchronous disk I / O: (one pass = 1 TiB) + (96 hash bits per record = 12 GiB) + (32 index bits for positive = ~ 200 MiB).

Final editing (seriously): On the other hand, the Bloom Filter feature may not help much here. The amount of hash data is a more limiting factor than the number of false positives. Using only one hash function, the total amount of hash data will be 4 gigabytes, and the 124 million expected false positive indexes will be ~ 500 MB. This should globally optimize this strategy.

Clarification (received lower limit): There is a difference between the false positive effect of the Bloom filter and the collision of hashes. A hash conflict cannot be resolved except by returning to the original records and comparing, which is expensive. The Bloom fluctuation can be resolved by returning to the original hash values ​​and comparing them, which makes the second pass of this algorithm. So, on the other hand, the one-way filter described in the "final" editing will over-call the disk. A two-hash Bloom filter would increase the number of false positives ending in one bucket of the match map, and bring the number of false positives back to tens of millions.

+18
Apr 04 2018-10-04T00:
source share

The proposed solutions are still too complicated. A Bloom filter , being a du jour data structure over the past few years, is not recommended for use in this situation: because no data can be associated with hashed content, you should not only support the Bloom filter, but you should still write each one (only 6-bit!) Hash value and write to disk, destroying the advantage of the flower filter and having an absurdly high collision speed.

On the other hand, merging, sorting the entire terabyte, will not only lead to comparisons of O(n log n) , but O(n log n) disk traffic, since most intermediate files will need to be combined from disk, not from memory, Any real the solution should try to minimize disk traffic, as this is our main bottleneck.

My solution is simple, which makes one assumption: that a terabyte of data is written in that one file is effective.

Iterate over terabyte file entries and their hash. The cryptographic hash here is superfluous, expensive, and too large; instead, use something like the 64-bit version of murmurhash . It can hash more than 2 gigabytes / sec (much faster than we probably will need, given the storage speed these days) and has excellent (although not cryptographically secure) collision resistance. With a 64-bit hash, we expect our first collision at the level of 2 ^ 32 , so it is likely that we will have about one billion entries that will not have any collisions at all.

Write the hashes and their associated record offsets to another file. Since records contain arbitrary binary data, we cannot rely on Unix sort (1) to sort it, because some hashes and offsets may contain what sort (1) will interpret as newlines. We just write the entries as a fixed width (maybe 16 bytes: 8 bytes for the 64-bit hash of murmur2 and 8 bytes for shifting in the file of the terabyte file). The resulting file should be about 16 kilobytes, given our record number.

We can sort this file by reading the number of records that will fit safely into memory and sort them by clearing the sorted pieces back to disk. We can put more records into memory using heapsort (it uses O(1) space) than with quick sort (which uses O(log n) memory for the call stack), but in most implementations quicksort wins because of its local memory and lower number of teams. These intermediate files (they should be 35-40) will be written to disk.

The last step is to combine these files (in memory, there is no need to store the result on disk for this), collecting all hash collisions and looking at related records in a terabyte file, comparing records for duplication and emitting records (or their offsets) in any way that indicates problem.

As far as I can tell, this task gets to disk much less than any other proposed solution, and it is very conceptually simple: hash records, search for duplicates in hashes and check in real records.

For an I / O disk, it will read a terabyte data file, write 16 GB to disk, read 16 GB from the disk and write it sorted, then read it and return the duplicates. As an optimization, the hash process of records could accumulate them in memory before releasing them to disk, sorting them before: it cuts off the 16 GB intermediate file and allows the process to move from hashing directly to merge and report duplicates.

+19
Apr 05 2018-10-10T00:
source share

This is a lot of entries ;-) in the order of 1,000,000,000. "Better think about it ...

The nature of the records is not defined: do we just open them one at a time, read them sequentially, or is there some kind of index, or maybe they are stored as files in different directories? Also uncertain in the question is the availability of dbms, which we can use for index data (instead of sorting it using our own code). Also, the [even crude] idea of ​​the number of duplicates will help direct some of the options to an efficient process.

If the index does not exist, we can / create it; this can be done as a first pass of data. The same pass will be used to create a message digest (hash) for each record (or, possibly, for performance purposes for the first few hundred bytes of the record).

The general idea is to quickly create an index that can be used to identify possible duplicates and complete the list of the actual duplicate, possibly through parallel processing .

Information useful in the index will be:

  • record length
  • first few bytes of text
  • hash code (more on this below)
  • also an offset in the file or any other pointer to data, but, of course, unlike the above three elements, this cannot be used to identify potential matches.

The choice of a hash is critical: it should contribute to a fast algorithm due to the fact that it is well distributed; the number of bytes hashed for each record is also a compromise, possibly from 100 to 200 bytes (i.e., from about 10 to 20% of the average record size) is a good value depending on the expected duplication rate and depending on time savings this provides (compared to hashing the entire record). (see below)

Once such an index is available, we can (relatively quickly / effortlessly) get the number of possible duplicates; based on this result, a second pass can be made, aimed at improving the quality of the index, if it is not considered sufficiently selective, (excluding records that are considered unique). This second pass can compute another hash, the whole record (excluding the first x bytes of the first hash) or on another subset of the record. Please note that thanks to the index, this second pass can be multithreaded if possible.

The second or final pass requires sorting the entries in the group of possible matches (the same length, the same hash code (s), the same first bytes). This can be achieved as described by Pax Diablo, the advantage of the index is that such an operation can again be multi-threaded and includes much smaller sets (many of them). Added : here again John Johnson notices that the second pass may not be necessary if we used a long hash code (it offers 128 bytes SHA1). Assuming that there is no advantage to partial hashing of records, this is a very plausible solution, since the index can be on disk and, nevertheless, sorted and stored faster than if we sorted / saved all the records.




Change Nick Johnson perfectly notes that latency of requests on a disk can be such that simple sequential reading will be faster and that the bottleneck is related to disk I / O, a quick hash function working at the same time can effectively be faster than sequential reading, and therefore , do not add to the overall process. This is a possible possibility (especially if sequential reading, if it is really necessary to detect each beginning / end of a record, etc.), and therefore I "cut my bet" by writing "depending on the time savings that it provides ..." . This suggests that the actual structure of the records on the disk is one of the open parameters of the question (for example, if we just read individual files in directories, therefore, we impose non-sequential reads), as well as a TeraByte-sized storage supported by fancy RAID, where search latency, remaining a problem tends to improve significantly.
I agree that two passes suitable may be more efficient than one where each record is completely hashed, but I would like to emphasize the possibility and advantages of the one-pass approach. As in many interviews, some characteristics of the situation were unspecified; the idea is not so much to see that the applicant gives an absolutely correct answer (although some answers may be completely wrong!), but instead to get an idea of ​​his thought process and his ability to determine options and decision points.

+13
Apr 04 2018-10-10T00:
source share

Find the appropriate hash function and hash of each record by saving the list of hashes with indexes to a file. Now sort the hash file using the hash. Finally, check all matching hash entries for real duplicates.

Of course, this depends on the number of duplicates that you expect to find, and what you are going to do with the information later.

+6
Apr 04 2018-10-10T00:
source share

Download data to 512M memory at a time, then sort this fragment and write it to disk (as your own file). After the entire 1T has been done this way, combine the sorting of the individual files into one large honkin file, then read this large (sorted) file sequentially, writing it to the final file, deleting duplicate entries.

1T, 512M at a time, will be about 2.1 million files (assuming binary definitions of SI units, not decimals). 512M of 1K records will only allow 524,288 records in memory at a time, so you may have to do a merge sort in two steps. In other words, merging - sort 2.1 million files in four groups to create four large files, and then combine the sorting of these four into a large sorted file. Then the one that you are processing sequentially to remove duplicates.

Merging-sorting is simply combining several files already sorted by simply selecting the first remaining record from each file and selecting “lowest”. For example, two files a and b :

 ab 7 6 3 5 1 4 2 \_/ 1 (a) 2 (b) 3 (a) 4 (b) 5 (b) 6 (b) 7 (a) 
+3
Apr 04 '10 at 5:27
source share

Create a hash of each record; write the record number and hash in memory, spilling the file if necessary (sort the data in a hash order before writing to the file). When you come up with a new hash, check if it exists in memory - this is an early detection. (This may or may not be a major advantage.).

When you read all the data, you will have several hash files plus the record numbers already sorted. Merge these files together, detecting duplicates as you move. You don’t even need to do more than write down duplicates for this application, so you can discard the hashes as soon as they become unique.

Given the sizes - 0.5 GB of memory, 1000 GB of data, 1 KB per record, so about 1 billion records - subject to a 256-bit hash (although 128-bit may well be adequate), we will use 32 bytes for the hash and 4 Bytes for the record number and about 1 billion records, we need about 36 GB for sorting files generated in 500 MB files (corresponding to available memory), so there would be 70-80 files to merge at the end, which seems pretty manageable. In the list you will see the record numbers - you will need to access the 1 TB file to read the records. You need to think about what you are going to do with duplicates; you need information about the original record and additions, and it matters which of the duplicates you keep and which you reject. Etc.

+1
Apr 04 2018-10-10T00:
source share

First, set up a computer with an infinitely large swap file on an infinitely large disk ...

+1
Apr 04 '10 at 17:14
source share

You can use the hash to reduce the size of the problem. For example, if you have 1 TB data, then you define a hash function and the data is divided into ten files (each file size is less than 1 TB). After that, if one file is still too large, repeat the procedure until the file is stored in memory. Finally, you can calculate the time of appearance at your discretion.

+1
Sep 18
source share



All Articles