The answer to H2 is good, but perhaps too much. All words in English will be no more than a few MB. Just use the kit. You can use this in the RAnders00 program.
public static void read50Gigs(String fileLocation, String newFileLocation) { Set<String> words = new HashSet<>(); try(FileInputStream fileInputStream = new FileInputStream(fileLocation); Scanner scanner = new Scanner(fileInputStream);) { while (scanner.hasNext()) { String nextWord = scanner.next(); words.add(nextWord); } System.out.println("words size "+words.size()); Files.write(Paths.get(newFileLocation), words, StandardOpenOption.CREATE, StandardOpenOption.WRITE); } catch (IOException e) { throw new RuntimeException(e); } }
As an assessment of common words, I added this for war and peace (from Gutenberg)
public static void read50Gigs(String fileLocation, String newFileLocation) { try { Set<String> words = Files.lines(Paths.get("war and peace.txt")) .map(s -> s.replaceAll("[^a-zA-Z\\s]", "")) .flatMap(Pattern.compile("\\s")::splitAsStream) .collect(Collectors.toSet()); System.out.println("words size " + words.size());
Completed in 0 seconds. You cannot use Files.lines
unless your huge source file has line breaks. With line breaks, it will process it in turn, so it will not use too much memory.
brian source share