Does the Marpa analyzer library support error recovery?

I know Perl "Marpa" The Earley analyzer has very good error reporting.

But I can not find in my documentation or through Googling if it has a bug fix.

For example, most C / C ++ compilers have a bug fix, which they use to report several syntax errors, where often other compilers stop at the first error.

I actually understand natural language and wonder if there is a way to re-sync and resume parsing after one part of the input fails.


Example, for those who can intercept it:

I parse syllables in Lao. In Lao, some vowels are diacritics, which are encoded as separate characters and displayed above the previous consonant. When analyzing random articles from Lao Wikipedia, I came across some text where such a vowel was doubled. This is not allowed in Lao's spelling, so it should be a typo. But I know that in two characters the text is good again.

In any case, this is a real example that aroused my general interest in error recovery or re-synchronization with the token stream.

+6
source share
1 answer

There are two ways to handle errors in Marpa.

"Ruby slippers" Analysis

Marpa maintains a lot of context during the scan. We can use this context so that the parser can require some token, and we can decide whether we want to offer it to Marpa, even if it is not at the input. Consider, for example, a programming language that requires that any statement end with a semicolon. Then we can use the Ruby Slippers methods to enter semicolons in certain places, for example, at the end of a line or before a closing bracket:

use strict; use warnings; use Marpa::R2; use Data::Dump 'dd'; my $grammar = Marpa::R2::Scanless::G->new({ source => \q{ :discard ~ ws Block ::= Statement+ action => ::array Statement ::= StatementBody (STATEMENT_TERMINATOR) action => ::first StatementBody ::= 'statement' action => ::first | ('{') Block ('}') action => ::first STATEMENT_TERMINATOR ~ ';' event ruby_slippers = predicted STATEMENT_TERMINATOR ws ~ [\s]+ }, }); my $recce = Marpa::R2::Scanless::R->new({ grammar => $grammar }); my $input = q( statement; { statement } statement statement ); for ( $recce->read(\$input); $recce->pos < length $input; $recce->resume ) { ruby_slippers($recce, \$input); } ruby_slippers($recce, \$input); dd $recce->value; sub ruby_slippers { my ($recce, $input) = @_; my %possible_tokens_by_length; my @expected = @{ $recce->terminals_expected }; for my $token (@expected) { pos($$input) = $recce->pos; if ($token eq 'STATEMENT_TERMINATOR') { # fudge a terminator at the end of a line, or before a closing brace if ($$input =~ /\G \s*? (?: $ | [}] )/smxgc) { push @{ $possible_tokens_by_length{0} }, [STATEMENT_TERMINATOR => ';']; } } } my $max_length = 0; for (keys %possible_tokens_by_length) { $max_length = $_ if $_ > $max_length; } if (my $longest_tokens = $possible_tokens_by_length{$max_length}) { for my $lexeme (@$longest_tokens) { $recce->lexeme_alternative(@$lexeme); } $recce->lexeme_complete($recce->pos, $max_length); return ruby_slippers($recce, $input); } } 

In the ruby_slippers function ruby_slippers you can also calculate how often you needed to wash the token. If this count exceeds a certain value, you can refuse parsing by throwing an error.

Input skip

If your input may contain an unverified junk file, you can try to skip this if otherwise the token is not found. To do this, the $recce->resume method accepts an optional position argument where normal parsing will resume.

 use strict; use warnings; use Marpa::R2; use Data::Dump 'dd'; use Try::Tiny; my $grammar = Marpa::R2::Scanless::G->new({ source => \q{ :discard ~ ws Sentence ::= WORD+ action => ::array WORD ~ 'foo':i | 'bar':i | 'baz':i | 'qux':i ws ~ [\s]+ }, }); my $recce = Marpa::R2::Scanless::R->new({ grammar => $grammar }); my $input = '1) Foo bar: baz and qux, therefore qux (foo!) implies bar.'; try { $recce->read(\$input) }; while ($recce->pos < length $input) { # ruby_slippers($recce, \$input); try { $recce->resume } # restart at current position catch { try { $recce->resume($recce->pos + 1) } }; # advance the position # if both fail, we go into a new iteration of the loop. } dd $recce->value; 

While the same effect can be achieved by using a token :discard that matches anything, skipping in our client code allows us to stop parsing if we need to make too many attempts.

+7
source

Source: https://habr.com/ru/post/974911/


All Articles