Left factor grammar of coffeescript expressions

I am writing an Antlr / Xtext parser for coffeescript grammar . This is at the beginning yet, I just moved a subset of the original grammar , and I'm stuck in the expressions. This scary rule expression has an error other than LL (*). Here I found some related questions, Help with left grammar factorization to remove left recursion and ANTLR Grammar for expressions , I also tried How to remove global rollback from your grammar , but it just shows a very simple case that I cannot use in real life. The ANTLR Grammar Tip: LL () post and left factoring gave me more info, but I still can't handle it.

My question is how to fix the following grammar (sorry, I could not simplify it and still keep the error). I think the problem is with the term rule, so I would appreciate fixing it locally instead of changing it all (I try to stay close to the rules of the original grammar). Pointers can also advise how to "debug" this erroneous grammar in your head.

 grammar CoffeeScript; options { output=AST; } tokens { AT_SIGIL; BOOL; BOUND_FUNC_ARROW; BY; CALL_END; CALL_START; CATCH; CLASS; COLON; COLON_SLASH; COMMA; COMPARE; COMPOUND_ASSIGN; DOT; DOT_DOT; DOUBLE_COLON; ELLIPSIS; ELSE; EQUAL; EXTENDS; FINALLY; FOR; FORIN; FOROF; FUNC_ARROW; FUNC_EXIST; HERECOMMENT; IDENTIFIER; IF; INDENT; INDEX_END; INDEX_PROTO; INDEX_SOAK; INDEX_START; JS; LBRACKET; LCURLY; LEADING_WHEN; LOGIC; LOOP; LPAREN; MATH; MINUS; MINUS; MINUS_MINUS; NEW; NUMBER; OUTDENT; OWN; PARAM_END; PARAM_START; PLUS; PLUS_PLUS; POST_IF; QUESTION; QUESTION_DOT; RBRACKET; RCURLY; REGEX; RELATION; RETURN; RPAREN; SHIFT; STATEMENT; STRING; SUPER; SWITCH; TERMINATOR; THEN; THIS; THROW; TRY; UNARY; UNTIL; WHEN; WHILE; } COMPARE : '<' | '==' | '>'; COMPOUND_ASSIGN : '+=' | '-='; EQUAL : '='; LOGIC : '&&' | '||'; LPAREN : '('; MATH : '*' | '/'; MINUS : '-'; MINUS_MINUS : '--'; NEW : 'new'; NUMBER : ('0'..'9')+; PLUS : '+'; PLUS_PLUS : '++'; QUESTION : '?'; RELATION : 'in' | 'of' | 'instanceof'; RPAREN : ')'; SHIFT : '<<' | '>>'; STRING : '"' (('a'..'z') | ' ')* '"'; TERMINATOR : '\n'; UNARY : '!' | '~' | NEW; // Put it at the end, so keywords will be matched earlier IDENTIFIER : ('a'..'z' | 'A'..'Z')+; WS : (' ')+ {skip();} ; root : body ; body : line ; line : expression ; assign : assignable EQUAL expression ; expression : value | assign | operation ; identifier : IDENTIFIER ; simpleAssignable : identifier ; assignable : simpleAssignable ; value : assignable | literal | parenthetical ; literal : alphaNumeric ; alphaNumeric : NUMBER | STRING; parenthetical : LPAREN body RPAREN ; // term should be the same as expression except operation to avoid left-recursion term : value | assign ; questionOp : term QUESTION? ; mathOp : questionOp (MATH questionOp)* ; additiveOp : mathOp ((PLUS | MINUS) mathOp)* ; shiftOp : additiveOp (SHIFT additiveOp)* ; relationOp : shiftOp (RELATION shiftOp)* ; compareOp : relationOp (COMPARE relationOp)* ; logicOp : compareOp (LOGIC compareOp)* ; operation : UNARY expression | MINUS expression | PLUS expression | MINUS_MINUS simpleAssignable | PLUS_PLUS simpleAssignable | simpleAssignable PLUS_PLUS | simpleAssignable MINUS_MINUS | simpleAssignable COMPOUND_ASSIGN expression | logicOp ; 

UPDATE: The final solution will use Xtext with an external lexer to avoid the intricacies of handling significant spaces . Here is a snippet of my version of Xtext:

 CompareOp returns Operation: AdditiveOp ({CompareOp.left=current} operator=COMPARE right=AdditiveOp)*; 

My strategy is to make Antlr a working analyzer first without using AST. (Well, this deserves a separate question if this is an acceptable approach.) So I don’t care about tokens at the moment, they are included to make development easier.

I know that the original grammar is LR. I do not know how close I can stay on it by going to LL.

UPDATE2 and SOLUTION: I could simplify my problem with the considerations received from Bart's answer. Here is a working grammar for handling simple expressions with function calls to illustrate it. The comment before expression shows my understanding.

 grammar FunExp; ID: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*; NUMBER: '0'..'9'+; WS: (' ')+ {skip();}; root : expression ; // atom and functionCall would go here, // but they are reachable via operation -> term // so they are omitted here expression : operation ; atom : NUMBER | ID ; functionCall : ID '(' expression (',' expression)* ')' ; operation : multiOp ; multiOp : additiveOp (('*' | '/') additiveOp)* ; additiveOp : term (('+' | '-') term)* ; term : atom | functionCall | '(' expression ')' ; 
+4
source share
1 answer

When you create a lexer and parser from your grammar, the following error is displayed on your console:

error (211): CoffeeScript.g: 52: 3: [fatal] the rule expression has a non-LL (*) solution due to recursive calls to the rules available from alts 1,3. Solve using left factoring or using syntactic predicates or using the backtrack = true parameter.

warning (200): CoffeeScript.g: 52: 3: The solution can match the input, for example, "{NUMBER, STRING}", using several alternatives: 1, 3

As a result, alternative 3 was disabled for this input.

(I emphasized the important bits)

This is only the first error, but you start with the first and get a little luck, errors below this first will also disappear if you correct the first.

The above error means that when you try to parse either NUMBER or STRING with the parser generated from your grammar, the parser can go in two ways when it ends in the expression rule

  expression
   : value // choice 1
   |  assign // choice 2
   |  operation // choice 3
   ; 

Namely, choice 1 and choice 3 can both parse a NUMBER or STRING , as you can see from the β€œpaths” that the parser can follow to correspond to these two options:

choice 1:

  expression
   value
     literal
       alphaNumeric: {NUMBER, STRING} 

choice 3:

  expression
   operation
     logicOp
       relationOp
         shiftOp
           additiveOp
             mathOp
               questionOp
                 term
                   value
                     literal
                       alphaNumeric: {NUMBER, STRING} 

In the last part of the warning, ANTLR informs you that it ignores choice 3 whenever either NUMBER or STRING parsed, forcing choice 1 to match that input (since it is defined before choice 3).

Thus, either CoffeeScript grammar is ambiguous in this regard (and somehow removes this ambiguity), or your implementation is wrong (I assume the latter :)). You need to correct this ambiguity in your grammar: i. Do not allow the choice of expression 1 and 3 to match the same input.


I noticed 3 more things in your grammar:

1

Take the following lexer rules:

  NEW: 'new';
 ...
 UNARY: '!'  |  '~' |  NEW 

Remember that the UNARY token UNARY never match the text 'new' , since the NEW token is defined before it. If you want to enable UNARY macth, remove the NEW rule and run:

  UNARY: '!'  |  '~' |  'new'; 

2

In cases where you collect several types of tokens in one, for example LOGIC :

  LOGIC: '&&' |  '||'; 

and then you use this token in the parser rules, for example:

  logicOp
   : compareOp (LOGIC compareOp) *
   ; 

But if you are going to evaluate such an expression at a later stage, you do not know what corresponds to this LOGIC token ( '&&' or '||' ), and you will need to check the internal text token to find out. You better do something like this (at least if you do some kind of evaluation at a later stage):

  AND: '&&';
 OR: '||';

 ...

 logicOp
   : compareOp (AND compareOp // easier to evaluate, you know it an AND expression
               |  OR compareOp // easier to evaluate, you know it an OR expression
               ) *
   ; 

3

Are you missing spaces (and no tabs?) With:

  WS: ('') + {skip ();}; 

but does CoffeeScript code make it a code block with spaces (and tabs), like Python? But maybe you will do it at a later stage?


I just saw that the grammar you are looking at is a jison grammar (which more or less represents an implementation of the bison in JavaScript). But the bison, and hence jison, generates LR parsers , while ANTLR generates LL parsers . Therefore, an attempt to approach the rules of the original grammar will only lead to big problems.

+4
source

Source: https://habr.com/ru/post/1382979/


All Articles