How to create an AST parser that makes syntax errors?

First, what to read about the analysis and construction of AST?

How to create a parser for a language (for example, SQL) that will build an AST and resolve syntax errors?

For example, for "3 + 4 * 5":

+ / \ 3 * / \ 4 5 

And for "3 + 4 * +" with a syntax error, the analyzer guesses what the user meant:

  + / \ 3 * / \ 4 + / \ ? ? 

Where to begin?

SQL:

  SELECT_________________ / \ \ . FROM JOIN / \ | / \ a city_name people address ON | =______________ / \ .____ . / \ / \ p address_id a id 
+5
source share
2 answers

The standard answer to the question of how to create parsers (which build AST) is to read standard texts at compilation. The compiler book Aho and Ullman's Dragon is pretty classic. If you do not have the patience to get the best reference materials, you will have more problems because they provide a theory and explore the intricacies. But here is my answer for people who are in a hurry, build recursive descent guerrillas.

You can create parsers with integrated error correction. There are many articles on this subject, a hot topic in the 1980s. Check out Google Scholar for a "syntax fix." The main idea is that the parser, faced with a parsing error, skips some well-known beacon (";" the instruction separator is quite popular for C-like languages, so you are asked in the comment if your language has operator terminators) or offers various deletions or inserts of the input stream to rise above the point of syntax error. The sheer variety of such patterns is amazing. The basic idea is to take into account as much error information as possible. One of the most intriguing ideas I've ever seen consisted of two parsers, one of which ran N tokens ahead of each other, looking for syntax errors in land mines, and the second parser - fixing loading errors based on N tokens available before of how he encounters a syntax error. This allows the second parser to act differently before proceeding with a syntax error. If you do not have this, most parsers throw out the left context and thereby lose the ability to correct. (I have never implemented such a scheme.)

The choice of things to insert can often be obtained from the information used to create the parser (often First and Follow ). This is relatively easy to do with L (AL) R parsers, because the parsing tables contain the necessary information and are accessible to the parser at the point where it encounters an error. If you want to understand how to do this, you need to understand the theory (about whether there is a compiler in this book), how parsers are created. (I have successfully implemented this circuit several times).

Regardless of what you do, correcting syntax errors does not help much, because it is almost impossible to guess what the writer of the analyzed document actually intended. This suggests that fancy schemes will not be really useful. I keep simple; people are happy to receive a bug report and some semi-graceful continuation of the parsing.

The real problem with translating your own analyzer into a real language is that real languages ​​are unpleasant messy things; people building real implementations are mistaken and stumbling over existing code bases or insist on bending / improving the language (standards for weaknesses, goodies for marketing), because it's cool. Expect to spend a lot of time recalibrating what you think is grammar against the truth of the true code. As a rule, if you want to work with the parser, it is better to get one that has a track record, and not roll it yourself.

Lesson Most people who have struggled to build a parser do not get it, is that if they want to do something useful with a result or analysis tree, they will need much more basic equipment than just a parser. Check my bio on " Life After Parsing. "

+3
source

There are two things a parser could do:

  • Report the error and ask the user to try again.
  • Correct the error and continue.

Generally speaking, the first is simpler (and safer). There may not always be enough information for the parser to infer intent when the syntax is incorrect. Depending on the circumstances, it can be dangerous to continue the repair, which makes the input syntactically correct, but semantically incorrect.

I wrote some manual recursive parsers for small languages. When writing code to explicitly interpret grammar rules (as opposed to using a parser), it is easy to detect errors because the next token does not match the production rule. Generated parsers usually send out a simplified message "expected $ (TOKEN_TYPE) here", which is not always useful to the user. Using a handwritten analyzer is often easier to give a more specific diagnostic message, but for each case it can take a lot of time.

If your goal is to have a problem in the report, but to continue parsing (so you can see if there are any additional problems), you can put a special AST node in the tree at the error point. This prevents the tree from falling apart.

Then you need to re-sync to some point outside the error to continue parsing. As Ira Baxter mentioned in her answer, you can look for a token, for example ';', which separates the statements. The correct token to search depends on the language you are playing. Another possibility is to guess what the user had in mind (for example, to display an additional token or another token at the moment of detecting the error), and then continue. If you encounter a different syntax error in the next few tokens, you can go back, make another assumption and try again.

+1
source

Source: https://habr.com/ru/post/1203474/


All Articles