How to extract unique lines to a file> 10 GB with 4 GB of RAM

I have a PC with 4 gigabyte RAM and a file with 10 GB of memory. Now I want to check if each line in the file is unique, so I wrote the following code:

import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.io.FileWriter; import java.io.IOException; import java.util.HashSet; import java.util.Set; public class Cleaner { public static void main(String[] args) throws IOException { if (args.length < 2) { System.out.println("Too less parameters!"); return; } File file = new File(args[0]); BufferedReader buff = new BufferedReader(new FileReader(file)); String line; Set<String> set = new HashSet<String>(); while ((line = buff.readLine()) != null) { set.add(line); } FileWriter fw = new FileWriter(args[1]); for (String s : set) { fw.write(s + "\n"); fw.flush(); } fw.close(); buff.close(); } } 

But I get an OutOfMemoryException exception, so my question is:
How do I change my code to get a file where each line is unique?
Thank you for your help.

+5
source share
3 answers

First, you can try to find duplicate linear hashes to identify potential duplicate lines:

 Map<Integer, Integer> hashes = new HashMap<> (); Map<Integer, Integer> dupes = new HashMap<> (); int i = 0; while ((line = buff.readLine()) != null) { int hash = line.hashCode(); Integer previous = hashes.get(hash); if (previous != null) { //potential duplicate dupes.put(i, previous); } else { hashes.put(hash, i); } ++i; } 

At the end, you have a list of potential duplicates. If dupes empty, there were no duplicates; if it is not, you can make a second pass in the file to check if the lines are really identical.

0
source

You cannot do this operation this way because of your RAM memory. Instead, you can read the file and generate n files with a fixed size (fe: 10,000 lines), read the line and put it in the actual file. When you reach the limit of the file, open a new one, free all objects to save memory, then execute the second cycle and compare each line of the source file using the line (for the line) with n generated files. Perhaps this way you can avoid a memory gap.

A little strange and will be a slow process, but in this way I think you can achieve your requirements.

If you need a code let me know.

Hope helps

0
source

You can trick something like this: (Groovy example, but equivalent Java will work)

 def hashes = [] def writer = new PrintWriter(new FileWriter("out.txt")) new File('test.txt').eachLine { line -> def hashCode = DigestUtils.sha256Hex(line) //Commons digest library if (!(hashCode in hashes)) { hashes << hashCode writer.println(line) } } writer.close() 

This does not require more than 1 GB of RAM. SHA256 hashes will probably give you much more confidence in the uniqueness of a string than the standard hashCode method.

-1
source

Source: https://habr.com/ru/post/1235194/


All Articles