Stack overflow in Prolog DCG grammar rule: how to process large lists efficiently or lazily

I am parsing a fairly simple file format, consisting of a series of lines, each line has some space-separated fields that look like this:

l 0x9823 1 s 0x1111 3 l 0x1111 12 

I am using SWI-Prolog. This is the DCG that I still have:

 :- consult(library(pure_input)). load_trace(Filename, Traces) :- phrase_from_file(trace_file_phrase(Traces), Filename). trace_file_phrase([]) --> []. trace_file_phrase([T|Ts]) --> trace_phrase(T), trace_file_phrase(Ts). trace_phrase(access(Type, Address, SinceLast)) --> access_type(Type), space, address(Address), space, nat(SinceLast), newline. access_type(load) --> "l". access_type(store) --> "s". address(Number) --> "0x", hexnum(Number). hexdigit(N) --> digit(N). hexdigit(10) --> "a". hexdigit(11) --> "b". hexdigit(12) --> "c". hexdigit(13) --> "d". hexdigit(14) --> "e". hexdigit(15) --> "f". hexnum(N) --> hexdigit(D), hexnum(D, N). hexnum(N, N) --> []. hexnum(A, N) --> hexdigit(D), { A1 is A*16 + D }, hexnum(A1, N). newline --> "\n". space --> " ". %% the following two productions are courtesy of Lars Mans at %% /questions/126176/parsing-numbers-with-multiple-digits-in-prolog digit(0) --> "0". digit(1) --> "1". digit(2) --> "2". digit(3) --> "3". digit(4) --> "4". digit(5) --> "5". digit(6) --> "6". digit(7) --> "7". digit(8) --> "8". digit(9) --> "9". nat(N) --> digit(D), nat(D,N). nat(N,N) --> []. nat(A,N) --> digit(D), { A1 is A*10 + D }, nat(A1, N). 

As mentioned in the comment, I cut the processing of numbers from Parsing numbers with multiple digits in Prolog .

The problem I am facing is that some of these files are large, for example about 5-10 MB. SWI-Prolog is not enough for this stack by default, and parsing these files takes a considerable time, of the order of 5-15 seconds. I have a few questions about this situation:

  • Where is the performance issue in this code? I think this is either in trace_file_phrase//1 , or nat//1 , but these are just guesses.
  • If the problem is a list, is there a better way to handle lists with DCG than this?
  • How, in general, to diagnose and handle performance problems with DCGs like this?
+16
prolog dcg swi-prolog
Oct 17 '12 at 17:19
source share
3 answers

As a general comment, you will find more information about it under the name library(pio) . It is also useful to use it cleanly:

 :- use_module(library(pio)). 

Your example is too complicated, so I will only consider a slightly simpler case - a list of numbers separated by a newline:

 nats ([]) -> [].
 nats ([N | Ns]) -> nat (N), newline, nats (Ns).

So how can you test this effectively? (What is your question 3) The main point of library(pio) is that you can use regular DCGs to process files. But for testing in small, you can still use simple phrase/2 . Therefore I:

 ? - phrase (nats (Ns), "1 \ n").
 Ns = [1];
 false

You see a hint ; ? This means: Prolog was unable to decide whether further answers could be calculated, so it leaves one or more selection points open. And this is only for one digit . You can imagine how things accumulate.

Let dig deeper:

 ? - phrase (digit (D), "1").
 D = 1;
 false

Again evil ; ! To make this work, everything had to be determined. There are three ways:

Use cuts (and lose your soul)

I wish you good luck - best after repeated element:

 trace_file_phrase ([]) -> [].
 trace_file_phrase ([T | Ts]) ->
    trace_phrase (T),
    !,% ugly, but ...
    trace_file_phrase (Ts).

(This should answer question 1)

But wait a minute! What's bad about it? ! ? As long as there is exactly one answer to the question trace_phrase//1 , everything is perfect. It is only if there are more answers (or actually solutions) that the cut can remove precious answers. How do you know if there are more solutions? Oh no. And you will not see them, as they have already been cut off.

call_semidet/1

Here's what you can do to prevent this from happening. This only works for free purposes without side effects that can be caused twice without any effects:

 call_semidet (Goal): -
    (call_nth (Goal, 2)
    -> throw (error (mode_error (semidet, Goal), _))
    ;  once (Goal)
    )

call_nth/2 used call_nth/2 , as defined in another post. (As an optimization, an implementation could avoid calling Goal twice when there is no pick-point open ...) Just to understand how this works:

 ? - phrase (nat (N), "1234").
 N = 1234;
 false

 ? - call_semidet (phrase (nat (N), "1234")).
 N = 1234.

 ? - call_semidet ((X = 1; X = 2)).
 ERROR: Unknown error term: mode_error (semidet, (2 = 1; 2 = 2))

So, this makes your little grammar an effective determination! Therefore, there is no need to reformulate anything!

There is currently no integration of this into grammar. You can do this very low level or rather pure use of library(lambda) .

 phrase_semidet (NT) ->
    call (S0 ^ S ^ call_semidet (phrase (NT, S0, S))).

Please note that in this particular case we do not use \ to rename.

 trace_file_phrase ([]) -> [].
 trace_file_phrase ([T | Ts]) ->
    phrase_semidet (trace_phrase (T)),
    trace_file_phrase (Ts).

Indexing index

Finally, a very laborious but clean way would be to rewrite everything in order to better profit from indexing (and perhaps help improve indexing in general ...) But this is a long way to go. To start:

 digit (D) -> [C],
    {c_digit (C, D)}.

 c_digit (0'0.0).
 c_digit (0'1,1).
 c_digit (0'2,2).
 c_digit (0'3,3).
 c_digit (0'4,4).
 c_digit (0'5.5).
 c_digit (0'6.6).
 c_digit (0'7.7).
 c_digit (0'8.8).
 c_digit (0'9.9).

This gives you now:

 ? - phrase (digit (D), "1").
 D = 1.

But you have another source of non-determinism, which is more likely due to how you define grammar. In nat//2 you see:

 nat (N, N) -> [].
 nat (A, N) -> digit (D), ....

The first rule always applies, i.e. "1234\n" will be parsed as "1" "12" "123" "1234" only the next newline//0 understands that it will be enough for the last - and then stick to it.

You can rewrite something for this, but then the code is no longer a pure small spectrum that you liked, can you? Perhaps this may change in the future.

eg. indexing in SWI is much better than before, maybe things are developing here too ....

The goal of library(pio) was to start this process. Compare this to Haskell - we are far from interact efficiency! But there is no inherent value:

 ... -> [] |  [_], ....

 ? - phrase_from_file ((..., "searchstring", ...), fichier).

as effective as grep, space-wise. That is, he works in a constant space. Therefore, I hope the code will work more in the future.

Edit: BTW, library(pio) already has impact efficiency: GC phases have been greatly improved, very much like Wadler. Elimination of some space leaks; ndash paper a quarter century ago. Things are evolving ...

+14
Oct 17 '12 at 20:10
source share

I checked stackoverflow in a 2Mb file. Then I rewrote the grammar using the library (dcg / basics) and now it works.

 :- [library(dcg/basics)]. load_trace_0(Filename, Ls) :- phrase_from_file(lines(Ls), Filename). lines([s(H,I)|R]) --> "s 0x", xinteger(H), " ", integer(I), blanks, !, lines(R). lines([l(H,I)|R]) --> "l 0x", xinteger(H), " ", integer(I), blanks, !, lines(R). lines([]) --> []. 

But then I tried to put the abbreviation on your grammar and it works too. So the answer from @gusbro (+1) solves your problem.

+6
Oct 17 '12 at 20:05
source share

About the issue of efficiency:

If your input is usually well-formed, I think you should swap nat/4 and hexnum/4 so that they read:

 nat(A,N) --> digit(D), { A1 is A*10 + D }, nat(A1, N). nat(N,N) --> []. hexnum(A, N) --> hexdigit(D), { A1 is A*16 + D }, hexnum(A1, N). hexnum(N, N) --> []. 

because you want to stop parsing a number when there are no more digits.

If used wisely, a cut ( ! ) Can help you in terms of performance, as well as in relation, since it cuts the prolog evaluation tree. For example, you can commit ( ! ) At the end of trace_file_phrase/3 (that is, after newline ), because you do not need to repeat this part of the input again to find other solutions.

+4
Oct 17 '12 at 18:22
source share



All Articles