This is a lot of entries ;-) in the order of 1,000,000,000. "Better think about it ...
The nature of the records is not defined: do we just open them one at a time, read them sequentially, or is there some kind of index, or maybe they are stored as files in different directories? Also uncertain in the question is the availability of dbms, which we can use for index data (instead of sorting it using our own code). Also, the [even crude] idea of the number of duplicates will help direct some of the options to an efficient process.
If the index does not exist, we can / create it; this can be done as a first pass of data. The same pass will be used to create a message digest (hash) for each record (or, possibly, for performance purposes for the first few hundred bytes of the record).
The general idea is to quickly create an index that can be used to identify possible duplicates and complete the list of the actual duplicate, possibly through parallel processing .
Information useful in the index will be:
- record length
- first few bytes of text
- hash code (more on this below)
- also an offset in the file or any other pointer to data, but, of course, unlike the above three elements, this cannot be used to identify potential matches.
The choice of a hash is critical: it should contribute to a fast algorithm due to the fact that it is well distributed; the number of bytes hashed for each record is also a compromise, possibly from 100 to 200 bytes (i.e., from about 10 to 20% of the average record size) is a good value depending on the expected duplication rate and depending on time savings this provides (compared to hashing the entire record). (see below)
Once such an index is available, we can (relatively quickly / effortlessly) get the number of possible duplicates; based on this result, a second pass can be made, aimed at improving the quality of the index, if it is not considered sufficiently selective, (excluding records that are considered unique). This second pass can compute another hash, the whole record (excluding the first x bytes of the first hash) or on another subset of the record. Please note that thanks to the index, this second pass can be multithreaded if possible.
The second or final pass requires sorting the entries in the group of possible matches (the same length, the same hash code (s), the same first bytes). This can be achieved as described by Pax Diablo, the advantage of the index is that such an operation can again be multi-threaded and includes much smaller sets (many of them). Added : here again John Johnson notices that the second pass may not be necessary if we used a long hash code (it offers 128 bytes SHA1). Assuming that there is no advantage to partial hashing of records, this is a very plausible solution, since the index can be on disk and, nevertheless, sorted and stored faster than if we sorted / saved all the records.
Change Nick Johnson perfectly notes that latency of requests on a disk can be such that simple sequential reading will be faster and that the bottleneck is related to disk I / O, a quick hash function working at the same time can effectively be faster than sequential reading, and therefore , do not add to the overall process. This is a possible possibility (especially if sequential reading, if it is really necessary to detect each beginning / end of a record, etc.), and therefore I "cut my bet" by writing "depending on the time savings that it provides ..." . This suggests that the actual structure of the records on the disk is one of the open parameters of the question (for example, if we just read individual files in directories, therefore, we impose non-sequential reads), as well as a TeraByte-sized storage supported by fancy RAID, where search latency, remaining a problem tends to improve significantly.
I agree that two passes suitable may be more efficient than one where each record is completely hashed, but I would like to emphasize the possibility and advantages of the one-pass approach. As in many interviews, some characteristics of the situation were unspecified; the idea is not so much to see that the applicant gives an absolutely correct answer (although some answers may be completely wrong!), but instead to get an idea of his thought process and his ability to determine options and decision points.