Duplicate removal in java on large-scale data

Question

Duplicate removal in java on large-scale data

I have the following problem. I connect to some place using the API and getting the data as input. the goal is to save data after deleting duplicate rows. duplication defined by columns 10, 15, 22.

I get data using multiple streams. Currently, I first save the data to a CSV file and then delete duplicates. I want to do this while I am reading data. The data volume is about 10 million records. I have limited memory that I can use. the machine has 32 GB of memory, but I am limited, as there are other applications that use it.

Here I read about the use of hash maps. but I'm not sure that I have enough memory to use it.

Does anyone have a suggestion to solve this problem?

+6

java

mikeP Nov 21 '16 at 10:14

source share

3 answers

Markus fischer · Answer 1 · 2016-11-21T10:34:21+0000

Hashmap will use at least as much memory as your original data. Therefore, it is probably not possible for the size of your dataset (however, you should check this, because if so, this is the easiest option).

What would I do is write the data to a file or database, calculate the hash value for deduplicated fields and save the hash values in memory with a suitable link to the file (for example, the byte index where the original value is in the written file). Of course, the link should be as small as possible.

When you hit a hash match, look at the original value and see if it is identical (since hashes for different values can fall together).

The question is how many duplicates you expect. If you expect the slightest coincidence, I would choose a cheap solution for writing and expensive reading, i.e. I dumped everything linearly into a flat file and read it back from this file.

If you expect a lot of matches, it's probably the other way around, that is, with an indexed file or set of files, or even a database (make sure it is a database where write operations are not too expensive).

lexicore · Answer 2 · 2016-11-21T10:34:45+0000

The decision depends on how big your data is in columns 10, 15, 22.

Assuming that it is not too large (say, about 1 kilobyte), you can actually implement the solution in memory.

Introduce the Key class to store values from columns 10, 15, 22. Carefully apply the equals and hashCode methods. (Instead, you can use a regular ArrayList .)
Create a Set that will contain the keys of all read records.
For each read entry, check if this key is included in this set. If yes, skip the recording. If not, write an entry for output, add the key to the set. Make sure that you are working with the installed thread safe method.

In the worst case, you will need a number of records * size of key amount of memory. For 10,000,000 records and an estimated <1kb per key, this should work with approximately 10 GB.

If the key size is still too large, you probably need a database to store the key set.

Another option would be to save key hashes instead of full keys. This will require much less memory, but you may encounter hash collisions. This can lead to "false positives", i.e. False duplicates that are not actually duplicated. To completely avoid this, you will need a database.

mayank agrawal · Answer 3 · 2016-11-21T10:21:14+0000

You can use ConcurrentHashSet. it will automatically delete the duplicate element and keep the stream to a certain limit

Duplicate removal in java on large-scale data

More articles: