ANTLR analyzes greed, even if it can match the high priority rule

I use the following ANTLR grammar to define a function.

definition_function : DEFINE FUNCTION function_name '[' language_name ']' RETURN attribute_type '{' function_body '}' ; function_name : id ; language_name : id ; function_body : SCRIPT ; SCRIPT : '{' ('\u0020'..'\u007e' | ~( '{' | '}' ) )* '}' { setText(getText().substring(1, getText().length()-1)); } ; 

But when I try to parse two functions as below,

 define function concat[Scala] return string { var concatenatedString = "" for(i <- 0 until data.length) { concatenatedString += data(i).toString } concatenatedString }; define function concat[JavaScript] return string { var str1 = data[0]; var str2 = data[1]; var str3 = data[2]; var res = str1.concat(str2,str3); return res; }; 

Then ANTLR does not analyze this as two definitions of functions, but as one function with the following body,

  var concatenatedString = "" for(i <- 0 until data.length) { concatenatedString += data(i).toString } concatenatedString }; define function concat[JavaScript] return string { var str1 = data[0]; var str2 = data[1]; var str3 = data[2]; var res = str1.concat(str2,str3); return res; 

Can you explain this behavior? A function body may have something in it. How to determine this grammar?

+6
source share
2 answers

Your rule matches that '\u0020'..'\u007e' from the rule '{' ('\u0020'..'\u007e' | ~( '{' | '}' ) )* '}' matches both { and } .

Your rule should work if you define it like this:

 SCRIPT : '{' ( SCRIPT | ~( '{' | '}' ) )* '}' ; 

However, this will fail if the script block contains, says, lines or comments containing { or } . The following is a way to map the SCRIPT token, including comments and string literals that may contain { and '}':

 SCRIPT : '{' SCRIPT_ATOM* '}' ; fragment SCRIPT_ATOM : ~[{}] | '"' ~["]* '"' | '//' ~[\r\n]* | SCRIPT ; 

A complete grammar that parses your input correctly will look like this:

 grammar T; parse : definition_function* EOF ; definition_function : DEFINE FUNCTION function_name '[' language_name ']' RETURN attribute_type SCRIPT ';' ; function_name : ID ; language_name : ID ; attribute_type : ID ; DEFINE : 'define' ; FUNCTION : 'function' ; RETURN : 'return' ; ID : [a-zA-Z_] [a-zA-Z_0-9]* ; SCRIPT : '{' SCRIPT_ATOM* '}' ; SPACES : [ \t\r\n]+ -> skip ; fragment SCRIPT_ATOM : ~[{}] | '"' ~["]* '"' | '//' ~[\r\n]* | SCRIPT ; 

which also parses the following input correctly:

 define function concat[JavaScript] return string { for (;;) { while (true) { } } var s = "}" // } return s }; 
+3
source

If you do not need SCRIPT to be a token (recognized by the lexer rule), you can use a parser rule that recognizes nested blocks ( block rule below). The grammar included here should analyze your example as two different definitions of functions.

 DEFINE : 'define'; FUNCTION : 'function'; RETURN : 'return'; ID : [A-Za-z]+; ANY : . ; WS : [ \r\t\n]+ -> skip ; test : definition_function* ; definition_function : DEFINE FUNCTION function_name '[' language_name ']' RETURN attribute_type block ';' ; function_name : id ; language_name : id ; attribute_type : 'string' ; id : ID; block : '{' ( ( ~('{'|'}') )+ | block)* '}' ; 
0
source

Source: https://habr.com/ru/post/982424/


All Articles