Erlang: using large memory to process a list of binary parts

Question

Erlang: using large memory to process a list of binary parts

I have a problem with huge memory usage in Erlang. Having reduced my problem (part of Framework MapReduce) to a minimum, consider this code:

{match, Idx} = re:run(SomeBinary, <<"[A-Za-z0-9_]+">>,[global]),
List = [ {binary:part(SomeBinary, Pos, Len), 1}  || [{Pos, Len}] <- Idx],   
Sorted = lists:keysort(1, List),

Processing 15 MB of binary encoded text in UTF-8 with 2672923 words, memory limit is 2 GB. 1.2 GB for the regular part, rest 800 MB for keysort ().

Even with all linked lists, links, etc., how is this possible? When I “pause” the process after calculation, the memory usage is reduced to 300 MB after a few seconds. I am running Erlang R16B03 on archlinux.

PS 1: I also tried to return the binary directly from regexp, but the memory usage was the same, the performance is a little worse.

PS 2: Processing a 30 MB file completely kills my RAM and leads to a replacement.

PS 3: the same logic implemented in Rust related to PCRE lib for regexp (erlang also uses PCRE) has a memory limit of 200 MB

Thanks.

+4

performance memory erlang

dorny Feb 14 '14 at 12:26

source share

2 answers

tokenize , , : split/3 ( ). , -, :

http://www.erlang.org/doc/man/binary.html#split-3

2.

: , , :

: sort (binary: split (Bin, [< " → , <". " → , <", " → , < → , <" )" → ], [global])).

, , , . , .

+1

dvaergiller 17 . '14 13:05

dorny · Accepted Answer · 2014-02-16T12:07:50+0000

My friend will show me a partial solution. Instead of using regexp, a manual tokenizer performs much better:

[{X, 1} || X <- words(Bin)].

words(Bin) ->
    words_2(Bin, [], []).

words_2(<<C, Rest/binary>>, CAcc, WAcc) when
        (C >= $A) and (C =< $Z);
        (C >= $a) and (C =< $z);
        (C >= $0) and (C =< $9);
        C =:= $_ ->
    words_2(Rest, [C | CAcc], WAcc);
words_2(<<_, Rest/binary>>, [], WAcc) ->
    words_2(Rest, [], WAcc);
words_2(<<>>, [], WAcc) ->
    lists:reverse(WAcc);
words_2(Rest, CAcc, WAcc) ->
    words_2(Rest, [], [list_to_binary(lists:reverse(CAcc)) | WAcc]).

This reduces the regular memory usage of 1.2 GB to an acceptable value. Unfortunately, 800 MB for lists: keysort (...) is similar to forof for using erlang.

RegExps , " " . , , RegExp .

RegExps, Erlang/PCRE "re".

Erlang: using large memory to process a list of binary parts

More articles: