I need to use the many differences of the two files in Java. Files have about 50 million lines, so I cannot fully load them in memory. I could follow these steps, but I plan on using a command commfrom linux that does this effectively.
- Is there a library in java to work efficiently?
- Is this a bad design for calling shell commands from a program?
More details
I have file1 and file2, each of which has over 40 million lines. I do not want to write them in my memory. I need to find the given difference file1 - file2. that is, lines that are in file1 but not in file2. In general, I would follow the algo:
1. Read file1 line by line and save it in HashSet.
2. Read file2 line by line.
3. Remove each line of file2 from Hashset if present
Is there a way to do this without saving file1 in a Hashset.
Edit: My solution
, , , . , , , * ( 14 * 1, .. 10 ) *, 10 ^ -9.
1. Read each line of file2 and add to Bloom Filter.
2. Now, file2 is compressed from 300MB+ to 40MB+
3. Read each line of file1, if not present in filter print the line