Delete duplicate words in a large text file - Java

I have a text file larger than 50 GB. Now I want to remove duplicate words. But I heard that I need a lot of RAM to load each Word from a text file into the Hash Set. Can you tell me a very good way to remove every duplicate word from a text file? Words are sorted by a space like this.

word1 word2 word3 ... ... 
-3
source share
3 answers

The answer to H2 is good, but perhaps too much. All words in English will be no more than a few MB. Just use the kit. You can use this in the RAnders00 program.

 public static void read50Gigs(String fileLocation, String newFileLocation) { Set<String> words = new HashSet<>(); try(FileInputStream fileInputStream = new FileInputStream(fileLocation); Scanner scanner = new Scanner(fileInputStream);) { while (scanner.hasNext()) { String nextWord = scanner.next(); words.add(nextWord); } System.out.println("words size "+words.size()); Files.write(Paths.get(newFileLocation), words, StandardOpenOption.CREATE, StandardOpenOption.WRITE); } catch (IOException e) { throw new RuntimeException(e); } } 

As an assessment of common words, I added this for war and peace (from Gutenberg)

 public static void read50Gigs(String fileLocation, String newFileLocation) { try { Set<String> words = Files.lines(Paths.get("war and peace.txt")) .map(s -> s.replaceAll("[^a-zA-Z\\s]", "")) .flatMap(Pattern.compile("\\s")::splitAsStream) .collect(Collectors.toSet()); System.out.println("words size " + words.size());//22100 Files.write(Paths.get("out.txt"), words, StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING, StandardOpenOption.WRITE); } catch (IOException e) {} } 

Completed in 0 seconds. You cannot use Files.lines unless your huge source file has line breaks. With line breaks, it will process it in turn, so it will not use too much memory.

+3
source

This approach uses a database to store found words.

He also suggests that words - regardless of case - are equal.

The H2 documentation states that a database on a file system other than FAT has a maximum size of 4 TB (using the default page size of 2 KB), which is more than enough for this purpose.

 package com.stackoverflow; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.OutputStreamWriter; import java.sql.*; import java.util.Scanner; public class H2WordReading { public static void main(String[] args) { // read50Gigs("50gigfile.txt", "cleaned50gigfile.txt"); read50Gigs("./testSmallFile", "./cleaned"); } public static void read50Gigs(String fileLocation, String newFileLocation) { try (Connection connection = DriverManager.getConnection("jdbc:h2:./words"); FileInputStream fileInputStream = new FileInputStream(fileLocation); Scanner scanner = new Scanner(fileInputStream); FileOutputStream fileOutputStream = new FileOutputStream(newFileLocation); OutputStreamWriter outputStreamWriter = new OutputStreamWriter(fileOutputStream)) { connection.createStatement().execute("DROP TABLE IF EXISTS WORDS;"); connection.createStatement().execute("CREATE TABLE WORDS(WORD VARCHAR NOT NULL);"); PreparedStatement insertStatement = connection.prepareStatement("INSERT INTO WORDS VALUES (?);"); PreparedStatement queryStatement = connection.prepareStatement("SELECT * FROM WORDS WHERE UPPER(WORD) = UPPER(?);"); while (scanner.hasNext()) { String nextWord = scanner.next(); queryStatement.setString(1, nextWord); ResultSet resultSet = queryStatement.executeQuery(); if (!resultSet.next()) // word not found, ok { outputStreamWriter.write(scanner.hasNext() ? (nextWord + ' ') : nextWord); insertStatement.setString(1, nextWord); insertStatement.execute(); } // word found, just don't write anything } } catch (IOException | SQLException e) { throw new RuntimeException(e); } } } 

You need to add H2 driver jar on the classpath.

Please note that I only tested this with a small file of 10 words or so. You should try this attempt with your 50 gigabyte file and report any errors.

Remember this attempt

  • normalizes all spaces and newlines to one space.
  • always uses the first occurrence of a word and deletes all upcoming occurrences

The time that this attempt takes the scale exponentially with the number of words in the file.

+1
source
 // Remove duplicate words from a file public String removeDupsFromFile(String str) { String[] words = str.split(" "); LinkedHashMap<String, Integer> map = new LinkedHashMap<String, Integer>(); for (int i = 0 ; i < words.length ; i++) { if (map.containsKey(words[i])) { int count = map.get(words[i]) + 1; map.put(words[i], count); } else { map.put(words[i], 1); } } StringBuilder result = new StringBuilder(""); Iterator itr = map.keySet().iterator(); while (itr.hasNext()) { result.append(itr.next() + " "); } return result.toString(); } 
0
source

Source: https://habr.com/ru/post/1244865/


All Articles