Imagine that we have a file called, for example, "A.txt". We know that there are some repeating elements. "A.txt" is very large, more than ten times larger than memory, maybe about 50 GB. Sometimes size B will be approximately equal to size A, sometimes it will be many times smaller than size A. Let it have such a structure:
a 1 b 2 c 445 a 1
We need to get a "B.txt" file that will not have such duplicates. For example, it should be like this:
a 1 b 2 c 445
I was thinking of an algorithm that copies A and makes B, then takes the first line in B and looks for each other, if it finds the same thing, removes duplicates. Then takes the second line, etc.
But I think this method is too slow. What can i use?
A is not a database! No SQL, please.
Sorry, that didn't say sorting is ok.
Although it can be sorted, what if it cannot be sorted?
source share