Hashmap will use at least as much memory as your original data. Therefore, it is probably not possible for the size of your dataset (however, you should check this, because if so, this is the easiest option).
What would I do is write the data to a file or database, calculate the hash value for deduplicated fields and save the hash values ββin memory with a suitable link to the file (for example, the byte index where the original value is in the written file). Of course, the link should be as small as possible.
When you hit a hash match, look at the original value and see if it is identical (since hashes for different values ββcan fall together).
The question is how many duplicates you expect. If you expect the slightest coincidence, I would choose a cheap solution for writing and expensive reading, i.e. I dumped everything linearly into a flat file and read it back from this file.
If you expect a lot of matches, it's probably the other way around, that is, with an indexed file or set of files, or even a database (make sure it is a database where write operations are not too expensive).
source share