Delete duplicate words in a large text file - Java

Question

Delete duplicate words in a large text file - Java

I have a text file larger than 50 GB. Now I want to remove duplicate words. But I heard that I need a lot of RAM to load each Word from a text file into the Hash Set. Can you tell me a very good way to remove every duplicate word from a text file? Words are sorted by a space like this.

word1 word2 word3 ... ...

-3

java set text java-8

Human Oct 24 '15 at 17:38

source share

3 answers

brian · Answer 1 · 2015-10-24T20:41:21+0000

The answer to H2 is good, but perhaps too much. All words in English will be no more than a few MB. Just use the kit. You can use this in the RAnders00 program.

 public static void read50Gigs(String fileLocation, String newFileLocation) { Set<String> words = new HashSet<>(); try(FileInputStream fileInputStream = new FileInputStream(fileLocation); Scanner scanner = new Scanner(fileInputStream);) { while (scanner.hasNext()) { String nextWord = scanner.next(); words.add(nextWord); } System.out.println("words size "+words.size()); Files.write(Paths.get(newFileLocation), words, StandardOpenOption.CREATE, StandardOpenOption.WRITE); } catch (IOException e) { throw new RuntimeException(e); } }

As an assessment of common words, I added this for war and peace (from Gutenberg)

 public static void read50Gigs(String fileLocation, String newFileLocation) { try { Set<String> words = Files.lines(Paths.get("war and peace.txt")) .map(s -> s.replaceAll("[^a-zA-Z\\s]", "")) .flatMap(Pattern.compile("\\s")::splitAsStream) .collect(Collectors.toSet()); System.out.println("words size " + words.size());//22100 Files.write(Paths.get("out.txt"), words, StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING, StandardOpenOption.WRITE); } catch (IOException e) {} }

Completed in 0 seconds. You cannot use Files.lines unless your huge source file has line breaks. With line breaks, it will process it in turn, so it will not use too much memory.

RAnders00 · Answer 2 · 2015-10-24T19:53:22+0000

This approach uses a database to store found words.

He also suggests that words - regardless of case - are equal.

The H2 documentation states that a database on a file system other than FAT has a maximum size of 4 TB (using the default page size of 2 KB), which is more than enough for this purpose.

 package com.stackoverflow; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.OutputStreamWriter; import java.sql.*; import java.util.Scanner; public class H2WordReading { public static void main(String[] args) { // read50Gigs("50gigfile.txt", "cleaned50gigfile.txt"); read50Gigs("./testSmallFile", "./cleaned"); } public static void read50Gigs(String fileLocation, String newFileLocation) { try (Connection connection = DriverManager.getConnection("jdbc:h2:./words"); FileInputStream fileInputStream = new FileInputStream(fileLocation); Scanner scanner = new Scanner(fileInputStream); FileOutputStream fileOutputStream = new FileOutputStream(newFileLocation); OutputStreamWriter outputStreamWriter = new OutputStreamWriter(fileOutputStream)) { connection.createStatement().execute("DROP TABLE IF EXISTS WORDS;"); connection.createStatement().execute("CREATE TABLE WORDS(WORD VARCHAR NOT NULL);"); PreparedStatement insertStatement = connection.prepareStatement("INSERT INTO WORDS VALUES (?);"); PreparedStatement queryStatement = connection.prepareStatement("SELECT * FROM WORDS WHERE UPPER(WORD) = UPPER(?);"); while (scanner.hasNext()) { String nextWord = scanner.next(); queryStatement.setString(1, nextWord); ResultSet resultSet = queryStatement.executeQuery(); if (!resultSet.next()) // word not found, ok { outputStreamWriter.write(scanner.hasNext() ? (nextWord + ' ') : nextWord); insertStatement.setString(1, nextWord); insertStatement.execute(); } // word found, just don't write anything } } catch (IOException | SQLException e) { throw new RuntimeException(e); } } }

You need to add H2 driver jar on the classpath.

Please note that I only tested this with a small file of 10 words or so. You should try this attempt with your 50 gigabyte file and report any errors.

Remember this attempt

normalizes all spaces and newlines to one space.
always uses the first occurrence of a word and deletes all upcoming occurrences

The time that this attempt takes the scale exponentially with the number of words in the file.

Nishantb · Answer 3 · 2016-12-18T20:13:10+0000

 // Remove duplicate words from a file public String removeDupsFromFile(String str) { String[] words = str.split(" "); LinkedHashMap<String, Integer> map = new LinkedHashMap<String, Integer>(); for (int i = 0 ; i < words.length ; i++) { if (map.containsKey(words[i])) { int count = map.get(words[i]) + 1; map.put(words[i], count); } else { map.put(words[i], 1); } } StringBuilder result = new StringBuilder(""); Iterator itr = map.keySet().iterator(); while (itr.hasNext()) { result.append(itr.next() + " "); } return result.toString(); }

Delete duplicate words in a large text file - Java

More articles: