Java efficient duplicate removal

Question

Java efficient duplicate removal

Let's say you have a large text file. Each line contains an email identifier and some other information (for example, some product identifier). Suppose there are millions of lines in a file. You must load this data into the database. How would you effectively deduplicate data (i.e., delete duplicates)?

+3

java

mnquasar Feb 25 '10 at 5:32

source share

6 answers

Enno Shioji · Answer 1 · 2010-02-25T07:35:59+0000

Unconditional row count

Use Map & Shorten the framework (e.g. Hadoop). This is a full-blown distributed computing system, so it is redundant if you do not have TB data. (j / k :))

Unable to set all lines in memory

: , . (, ). , .
: , , HashSet (. ), - HashSet. ConcurrentHashMap . - ConcurrentSkipListSet. compareTo() equals()/hashCode() (compareTo() == 0 ) SortedSet.

, , equals()/hashCode() HashSet.
(, , , ).

, , ...

TheSteve0 · Answer 2 · 2010-02-25T05:40:26+0000

. - , - ( ). , , , . SQL HashMap. eqbridges, , "gazillion".

akf · Answer 3 · 2010-02-25T05:43:44+0000

:

Java: - HashSet - , , .
: , . , .

Fabrizio Fortino · Answer 4 · 2014-05-08T16:15:45+0000

Duke (https://github.com/larsga/Duke) , java. Lucene ( ). ( , jaro winkler ..), .

Lawrence Dol · Answer 5 · 2010-02-25T07:30:04+0000

? + prodid .

Thomas jung · Answer 6 · 2010-02-25T08:02:54+0000

Your problem can be solved with the help of "Extract, convert, load (ETL)"):

You load your data into the import scheme;
Do every transformation you like on the data;
Then load it into the target database schema.

You can do this manually or use the ETL tool.

Java efficient duplicate removal

Unconditional row count

Unable to set all lines in memory

More articles: