Grammar Rules for Comments

I work with reflect.js (a good Javascript parser) from Zach Carter on github; I am trying to change the behavior of my parser in order to process comments like regular tokens, which should be parsed, like everything else. The default.js behavior is to track all comments (lexer captures them as tokens), and then add a list of them to the end of the created AST (abstract syntax tree).

However, I would like these comments to be included in the place in the AST. I believe this change will include adding grammar rules to the grammar.y file here . There are currently no rules for comments. If my understanding is correct, that’s why they are ignored by the main parsing code.

How do you write the rules for including comments in the AST?

+4
source share
1 answer

The naive version changes each rule of the original grammar:

LHS = RHS1 RHS2 ... RHSN ; 

:

  LHS = RHS1 COMMENTS RHS2 COMMENTS ... COMMENTS RHSN ; 

Although this works in an abstract way, it will most likely ruin your parser generator if it is based on LL or LALR, because now it cannot see far enough ahead of just the next token to decide what to do. Therefore, you will have to switch to a more powerful parser generator such as GLR.

A smarter version replaces (only and) each terminal T with nonterminal:

  T = COMMENTS t ; 

and modifies the original lexer to trivially emit t instead of T. You still have write problems.

But this gives us the basis for a real solution.

A more complicated version is to force lexer to collect the comments visible in front of the token and attach them to the next token that it emits; in essence, we are implementing a modification of the terminal grammar rule in a lexer.

Now the analyzer (you don’t need to switch technologies) just sees the markers that it originally saw; tokens carry comments as annotations. It will be useful for you to divide the comments into those that are attached to the previous token, and those that are attached to the next token, but you cannot do it better than the heuristic, because there is no practical way to decide which tokens really belong to.

You will be interested in learning how to capture positioning information on tokens and comments in order to enable source text regeneration ("comments in appropriate places"). You will find it more fun to actually recover text with the appropriate radius values, escape character strings, etc. Thus, in order not to violate the rules of the syntax of the language.

We do this with our common language processing tools, and it works quite well. It's amazing how much work so that everything is straightforward so that you can focus on your conversion task. People underestimate this a lot.

+3
source

Source: https://habr.com/ru/post/1490806/


All Articles