Parser Errors - a template for automatic error handling

Question

Parser Errors - a template for automatic error handling

Is there any known way to implement good error handling for machine parsers? Is there a “pattern” or known algorithm for this kind of problem?

For “good”, I mean something similar to the results obtained using manually processed recursive descent parsers and modern compilers: Parser does not stop on the first error, it can be done to emit “significant” errors, not just "unrecognized token in xyz string" one error at a time.

Ideally, this approach should be automated, not manual.

I'm not looking for a library, I need an approach that can be used on different platforms and ideally would be as independent of the language as possible.

+5

compiler-construction parsing grammar

Michele giuseppe fadda Nov 14 '15 at 13:17

source share

4 answers

People tried to figure out how to report and correct syntax errors from the first. There are many technical documents on how to do this. The hunt for the line "syntax correction" on scholar.google.com gives 57 views.

There are several problems:

1) How to report a significant error to the reader. To begin with, where the parser detects an error and where the user actually made the Error. For example, a C program might have a “++” operator in a strange place:

 void p { x = y ++ z = 0; <EOF>

Most parsers will throttle when the "z" is met, and report it as the place of the error. However, if the error uses “++” when “+” was intended, this report is incorrect. Unfortunately, getting this right requires you to read the thoughts of a programmer.

You also have a problem with the error message. Are you reporting the error as an expression [at first glance it seems so]? in the statement? In line? In a functional body? In function declaration? You probably want to report in the narrowest syntax category that can surround the error point. (Note that you cannot tell the body or function declaration as the "environment" of the error point because they are not complete either!) What if the error was really a missing semicolon after ++? Then the places of errors were not really “in expression”. What should I do if repair requires the insertion of a missing line? Macro Continuation Symbol?

So, you need to somehow decide what constitutes the actual error, and this leads to the correction of errors.

2) Error repair: in order for the tool to work in a meaningful way, it must eliminate the error. Presumably this means fixing the flow of input tokens to create a legal program (which you may not be able to do if the source has several errors). What if there are several possible patches? It should be obvious that the best error report is "yyyy is wrong, I suspect you should use xxxx". How big a patch should be considered for repairs: only the token that caused the error, the tokens that follow it, what about the tokens that precede it?

I note that it is difficult to make an automatic general error correction proposal for handwritten parsers, because the grammar needed to guide such repairs is clearly not available anywhere. Therefore, you expect auto repair to work better with tools for which grammar was a clear artifact.

It is also possible that when correcting errors, common errors should be considered. If people tend to leave ';' turning off and pasting one file fix, it can be a good repair. If they rarely do this, and there is more than one repair (for example, replace "++" with "+), an alternative repair is probably the best deal.

3) The semantic effect of repair. Even if you correct syntax errors, the corrected program may be unreasonable. If your mistake requires inserting an identifier, which identifier should I use?

FWIW, our DMS Software Reengineering Toolkit, performs automated grammar-driven repairs. It works under the assumption that the token at the point of error should be deleted or that some other single token should be inserted into it on the left. This is not enough; and additional plus signs; often succeeds in legal repair. Often this is not "right." At the very least, it allows the analyzer to go to the rest of the source code.

I think that the hunt for a good automatic error correction will continue for a long time.

FWIW, "Syntax recovery error" article for Java-based parser generator reports that Burke Ph.D. Thesis:

MG Burke, 1983, Practical Method for Diagnosing and Recovering Syntax Errors LR and LL, Ph.D., Department of Computer Science, New York University.

pretty good. In particular, it corrects errors by examining and revising the left error context, as well as the error area. It seems to get it from ACM

+3

Ira Baxter Nov 14 '15 at 18:47

source share

I have a completely different perspective on this issue, which is that you should not treat syntax errors as internal compiler errors. Any practical compiler actually implements three languages:

Language L, designated target language. The right programs are members of this language.
M language, which consists of L plus all errors that are recognized by the compiler. M \ L members get informative errors.
The Z language, which the compiler normally terminates. This set should be the set of all possible input lines, but if the compiler crashes at some input, it is not. Z \ M members receive general messages about how the compiler was compiled, usually from the form "parser could not execute the line x, char y".

You can use the tools of the automatic parser generator, as you are looking for, if you specify the language M in your parser instead of the language L. The problem with this approach is that the developers of the language always indicate L, not M. I can Do not think about one case where there is something like a standard for M.

This is not just absurd stupidity. There is a recent change in C ++ that illustrates this difference well. It used to be that

 template< class T > class X; template< class T > class Y; X<Y<int>> foo; // syntax in M

had an error in line three because the characters “→” were a marker for the right shift operator. This line should have been written

 X<Y<int> > foo; // syntax in L

The standard has been changed so as not to require additional space. The reason was that all the major compilers had already written code to recognize this case, in order to generate a meaningful error message. In other words, they learned that the M language is already implemented everywhere. Once the committee determined that they passed the M-syntax to the new version of L.

We would have a better language design if the designers looked at the M language at the same time that they work in L. Just for their own sanity, they would make an effort to minimize the size of the specification for M, which would be good for everyone. Alas, the world does not exist yet.

The result is that you need to create your own language M. This is a difficult problem. Regardless of whether you use an automated tool for this, somewhat next to this item. This helps, but he does not get rid of the most laborious part.

+3

eh9 Dec 03 '15 at 15:43

source share

This is probably not what you want to hear, but it's best not to write a parser and lexer.

This is not a particularly difficult task (especially compared to writing a semantics analyzer and code generator) and will give the best results when processing errors.

But do not trust me, trust Walter Bright to the author of the first native C ++ compiler and the inventor of the programming language D.

He has an article about this on Dr. Dobbs here . (error recovery on page 2)

+1

Computermatronic Dec 02 '15 at 11:58

source share

rurban · Accepted Answer · 2015-12-06T09:31:44+0000

Using the traditional YACC / bison generator, you get yyerror / YYERROR , which is not easy to generate very useful error messages due to the disordered backtracking of LALR parsers. You can even add error recovery rules there because you might need them to suppress incorrect error messages in broken rules, where you only need to allow parsing rules.

With PEG-based parsing, you get much better postfix message block syntax for ~{} . See for example. peg manual .

  rule = e1 e2 e3 ~{ error("e[12] ok; e3 has failed"); } | ... rule = (e1 e2 e3) ~{ error("one of e[123] has failed"); } | ...

You get excellent error messages at the location of the error. But you have to write PEG rules that are not so easy to write, especially. when processing operator priority. This is easier with the LALR parser.

With the simpler recursive generation parser, you got the same benefits of PEG error messages, but with much less parsing speed.

See the same discussion at http://lambda-the-ultimate.org/node/4781

Parser Errors - a template for automatic error handling

More articles: