My friend will show me a partial solution. Instead of using regexp, a manual tokenizer performs much better:
[{X, 1} || X <- words(Bin)].
words(Bin) ->
words_2(Bin, [], []).
words_2(<<C, Rest/binary>>, CAcc, WAcc) when
(C >= $A) and (C =< $Z);
(C >= $a) and (C =< $z);
(C >= $0) and (C =< $9);
C =:= $_ ->
words_2(Rest, [C | CAcc], WAcc);
words_2(<<_, Rest/binary>>, [], WAcc) ->
words_2(Rest, [], WAcc);
words_2(<<>>, [], WAcc) ->
lists:reverse(WAcc);
words_2(Rest, CAcc, WAcc) ->
words_2(Rest, [], [list_to_binary(lists:reverse(CAcc)) | WAcc]).
This reduces the regular memory usage of 1.2 GB to an acceptable value. Unfortunately, 800 MB for lists: keysort (...) is similar to forof for using erlang.
RegExps , " " . , , RegExp .
RegExps, Erlang/PCRE "re".