Improving Java 8 is a way to find the most common words in "War and Peace",

Question

Improving Java 8 is a way to find the most common words in "War and Peace",

I read this issue in Richard Byrd's book: Find the top five most common words in War and Peace (or any other text for that matter).

Here is my current attempt:

public class WarAndPeace { public static void main(String[] args) throws Exception { Map<String, Integer> wc = Files.lines(Paths.get("/tmp", "/war-and-peace.txt")) .map(line -> line.replaceAll("\\p{Punct}", "")) .flatMap(line -> Arrays.stream(line.split("\\s+"))) .filter(word -> word.matches("\\w+")) .map(s -> s.toLowerCase()) .filter(s -> s.length() >= 2) .collect(Collectors.toConcurrentMap( w -> w, w -> 1, Integer::sum)); wc.entrySet() .stream() .sorted((e1, e2) -> Integer.compare(e2.getValue(), e1.getValue())) .limit(5) .forEach(e -> System.out.println(e.getKey() + ": " + e.getValue())); } }

It definitely looks interesting and works fast enough. The following is printed on my laptop:

 $> time java -server -Xmx10g -cp target/classes tmp.WarAndPeace the: 34566 and: 22152 to: 16716 of: 14987 a: 10521 java -server -Xmx10g -cp target/classes tmp.WarAndPeace 1.86s user 0.13s system 274% cpu 0.724 total

It usually lasts less than 2 seconds. Can you suggest further improvements to this in terms of expressiveness and performance?

PS: If you are interested in the rich history of this problem, see here .

+5

java-8 java-stream

Kedar mhaswade Mar 11 '16 at 3:57

source share

2 answers

You perform several redundant and unnecessary operations.

First, you replace all the punctuation characters with blank lines, create new lines, then perform a split operation using spaces as a border. This even carries the risk of merging words separated by punctuation without an interval. You can fix this by replacing the punctuation with spaces, but in the end you do not need this replacement, because you can change the separation pattern to "punctuation or space", but
Then you filter the separation results by accepting strings consisting only of the characters of the word. Since you have already deleted all punctuation and spacing characters, this will sort the lines containing characters that are not characters of words, spaces or punctuation marks, and I'm not sure if this is the intended logic. After all, if you are only interested in words, why not look for words only in the first place? Since Java 8 does not support match streams, we can direct it to separate using characters other than the word, as on the border.
Then you do .map(s -> s.toLowerCase()).filter(s -> s.length() >= 2) . Since the string length does not change for English texts when changing it in upper case, the filter condition does not change, so we can filter by skipping the toLowerCase conversion for strings that are not accepted by the predicate: .filter(s -> s.length() >= 2).map(s -> s.toLowerCase()) . The net benefit may be small, but it does not hurt.
Choosing the right Collector . Tagir has already explained this . Basically, theres Collectors.counting() , which is better than Collectors.summingInt(w->1) , but, unfortunately, the current Oracles implementation is bad because it is based on reduce , decompression, and reboxing Long for all elements.

Combining all this, you will receive:

 Files.lines(Paths.get("/tmp", "/war-and-peace.txt")) .flatMap(Pattern.compile("\\W+")::splitAsStream) .filter(s -> s.length() >= 2) .map(String::toLowerCase) .collect(Collectors.groupingBy(w->w, Collectors.summingInt(w->1))) .entrySet() .stream() .sorted(Map.Entry.comparingByValue(Comparator.reverseOrder())) .limit(5) .forEach(e -> System.out.println(e.getKey() + ": " + e.getValue()));

As explained, do not be surprised if the number of words is slightly higher than in your approach.

+7

Holger Mar 11 '16 at 9:54

source share

Tagir valeev · Accepted Answer · 2016-03-11T04:33:41+0000

You recompile all regular expressions on every line and on every word. Instead of .flatMap(line -> Arrays.stream(line.split("\\s+"))) write .flatMap(Pattern.compile("\\s+")::splitAsStream) . Same for .filter(word -> word.matches("\\w+")) : use .filter(Pattern.compile("^\\w+$").asPredicate()) . The same goes for map .

It might be better to change .map(s -> s.toLowerCase()) and .filter(s -> s.length() >= 2) so as not to call toLowerCase() for single-letter words.

You cannot use Collectors.toConcurrentMap(w -> w, w -> 1, Integer::sum) . Firstly, your thread is not parallel, so you can easily replace toConcurrentMap with toMap . Secondly, it would probably be more efficient (although testing is necessary) to use Collectors.groupingBy(w -> w, Collectors.summingInt(w -> 1)) , as this would reduce the box (but add a finisher step that will enter all values immediately).

Instead of (e1, e2) -> Integer.compare(e2.getValue(), e1.getValue()) you can use a ready-made comparator: Map.Entry.comparingByValue() (although this is probably a matter of taste).

Summarizing:

 Map<String, Integer> wc = Files.lines(Paths.get("/tmp", "/war-and-peace.txt")) .map(Pattern.compile("\\p{Punct}")::matcher) .map(matcher -> matcher.replaceAll("")) .flatMap(Pattern.compile("\\s+")::splitAsStream) .filter(Pattern.compile("^\\w+$").asPredicate()) .filter(s -> s.length() >= 2) .map(s -> s.toLowerCase()) .collect(Collectors.groupingBy(w -> w, Collectors.summingInt(w -> 1))); wc.entrySet() .stream() .sorted(Map.Entry.comparingByValue(Comparator.reverseOrder())) .limit(5) .forEach(e -> System.out.println(e.getKey() + ": " + e.getValue()));

If you don't like method references (some people don't), you can store precompiled regular expressions in variables instead.

Improving Java 8 is a way to find the most common words in "War and Peace",

More articles: