You recompile all regular expressions on every line and on every word. Instead of .flatMap(line -> Arrays.stream(line.split("\\s+")))
write .flatMap(Pattern.compile("\\s+")::splitAsStream)
. Same for .filter(word -> word.matches("\\w+"))
: use .filter(Pattern.compile("^\\w+$").asPredicate())
. The same goes for map
.
It might be better to change .map(s -> s.toLowerCase())
and .filter(s -> s.length() >= 2)
so as not to call toLowerCase()
for single-letter words.
You cannot use Collectors.toConcurrentMap(w -> w, w -> 1, Integer::sum)
. Firstly, your thread is not parallel, so you can easily replace toConcurrentMap
with toMap
. Secondly, it would probably be more efficient (although testing is necessary) to use Collectors.groupingBy(w -> w, Collectors.summingInt(w -> 1))
, as this would reduce the box (but add a finisher step that will enter all values immediately).
Instead of (e1, e2) -> Integer.compare(e2.getValue(), e1.getValue())
you can use a ready-made comparator: Map.Entry.comparingByValue()
(although this is probably a matter of taste).
Summarizing:
Map<String, Integer> wc = Files.lines(Paths.get("/tmp", "/war-and-peace.txt")) .map(Pattern.compile("\\p{Punct}")::matcher) .map(matcher -> matcher.replaceAll("")) .flatMap(Pattern.compile("\\s+")::splitAsStream) .filter(Pattern.compile("^\\w+$").asPredicate()) .filter(s -> s.length() >= 2) .map(s -> s.toLowerCase()) .collect(Collectors.groupingBy(w -> w, Collectors.summingInt(w -> 1))); wc.entrySet() .stream() .sorted(Map.Entry.comparingByValue(Comparator.reverseOrder())) .limit(5) .forEach(e -> System.out.println(e.getKey() + ": " + e.getValue()));
If you don't like method references (some people don't), you can store precompiled regular expressions in variables instead.
source share