Java efficient duplicate removal

Let's say you have a large text file. Each line contains an email identifier and some other information (for example, some product identifier). Suppose there are millions of lines in a file. You must load this data into the database. How would you effectively deduplicate data (i.e., delete duplicates)?

+3
source share
6 answers

Unconditional row count

  • Use Map & Shorten the framework (e.g. Hadoop). This is a full-blown distributed computing system, so it is redundant if you do not have TB data. (j / k :))

Unable to set all lines in memory

  • : , . (, ). , .
  • : , , HashSet (. ), - HashSet. ConcurrentHashMap . - ConcurrentSkipListSet. compareTo() equals()/hashCode() (compareTo() == 0 ) SortedSet.

  • , , equals()/hashCode() HashSet.
  • (, , , ).

, , ...

+7

. - , - ( ). , , , . SQL HashMap. eqbridges, , "gazillion".

+1

:

  • Java: - HashSet - , , .

  • : , . , .

+1

Duke (https://github.com/larsga/Duke) , java. Lucene ( ). ( , jaro winkler ..), .

+1

? + prodid .

0

Your problem can be solved with the help of "Extract, convert, load (ETL)"):

  • You load your data into the import scheme;
  • Do every transformation you like on the data;
  • Then load it into the target database schema.

You can do this manually or use the ETL tool.

0
source

Source: https://habr.com/ru/post/1734367/


All Articles