ANTLR - problem with setting up AST hierarchy

In ANTLR, I am trying to deal with tree building operators (^ and!).

I have a grammar for flexible byte arrays (UINT16, which describe the number of bytes in the array, followed by many bytes). I commented out all the semantic predicates and the code associated with them, which confirms that the array has as many bytes as indicated in the first two bytes ... this part is not what I'm having problems with.

My problem is with the tree, which is created after parsing some input. All that happens is that each character is a brother of node. I expected that the generated AST would look like a tree, which you can see in the Interpreter ANTLRWorks 1.4 window. As soon as I try to change the way the tree is created using the ^ symbol, I get an exception from the form:

Unhandled Exception: System.SystemException: more than one node as root (TODO: make exception hierarchy) 

Here's the grammar (currently targeting C #):

 grammar FlexByteArray_HexGrammar; options { //language = 'Java'; language = 'CSharp2'; output=AST; } expr : array_length remaining_data //the amount of remaining data must be equal to the array_length (times 2 since 2 hex characters per byte) // need to check if the array length is zero first to avoid checking $remaining_data.text (null reference) in that situation. //{ ($array_length.value == 0 && $remaining_data.text == null) || ($remaining_data.text != null && $array_length.value*2 == $remaining_data.text.Length) }? ; array_length //returns [UInt16 value] : uint16_little //{ $value = $uint16_little.value; } ; hex_byte1 //needed just so I can distinguish between two bytes in a uint16 when doing a semantic predicate (or whatever you call it where I write in the target language in curly brackets) : hex_byte ; uint16_big //returns [UInt16 value] : hex_byte1 hex_byte //{ $value = Convert.ToUInt16($hex_byte.text + $hex_byte1.text); } ; uint16_little //returns [UInt16 value] : hex_byte1 hex_byte //{ $value = Convert.ToUInt16($hex_byte1.text + $hex_byte.text); } ; remaining_data : hex_byte* ; hex_byte : HEX_DIGIT HEX_DIGIT ; HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ; 

Here is what I thought AST would be:

ANTLRWorks 1.4 interpreter output for input of "0001ff"

Here's a C # program that I used to get a visual (actually textual, but then I put it through GraphViz to get an image), an AST view:

 namespace FlexByteArray_Hex { using System; using Antlr.Runtime; using Antlr.Runtime.Tree; using Antlr.Utility.Tree; public class Program { public static void Main(string[] args) { ICharStream input = new ANTLRStringStream("0001ff"); FlexByteArray_HexGrammarLexer lex = new FlexByteArray_HexGrammarLexer(input); CommonTokenStream tokens = new CommonTokenStream(lex); FlexByteArray_HexGrammarParser parser = new FlexByteArray_HexGrammarParser(tokens); Console.WriteLine("Parser created."); CommonTree tree = parser.expr().Tree as CommonTree; Console.WriteLine("------Input parsed-------"); if (tree == null) { Console.WriteLine("Tree is null."); } else { DOTTreeGenerator treegen = new DOTTreeGenerator(); Console.WriteLine(treegen.ToDOT(tree)); } } } } 

Here, that the output of this program, placed in GraphViz, looks like this: graph viz output

The same program in Java (in case you want to try it and not use C #):

 import org.antlr.*; import org.antlr.runtime.*; import org.antlr.runtime.tree.*; public class Program { public static void main(String[] args) throws Exception { FlexByteArray_HexGrammarLexer lex = new FlexByteArray_HexGrammarLexer(new ANTLRStringStream("0001ff")); CommonTokenStream tokens = new CommonTokenStream(lex); FlexByteArray_HexGrammarParser parser = new FlexByteArray_HexGrammarParser(tokens); System.out.println("Parser created."); CommonTree tree = (CommonTree)parser.expr().tree; System.out.println("------Input parsed-------"); if (tree == null) { System.out.println("Tree is null."); } else { DOTTreeGenerator treegen = new DOTTreeGenerator(); System.out.println(treegen.toDOT(tree)); } } } 
+6
source share
1 answer

Anssssss wrote:

As soon as I try to change the way the tree is created using the ^ symbol, I get an exception from the form:

When trying to make the parser rule a root of the tree inside p as follows:

 p : a^ b; a : AA; b : BB; 

ANTLR does not know which of a is the root of rule a . Of course, there cannot be two roots.

Inline-tree operators are convenient in some cases, but in this case they do not correspond to the task. You cannot assign a root within a production rule that may not have content, such as your remaining_data rule. In this case, you need to create "imaginary tokens" in the tokens { ... } section tokens { ... } your grammar and use the rewrite rule ( -> ^( ... ) ) to create your AST.

Demo

The following grammar:

 grammar FlexByteArray_HexGrammar; options { output=AST; } tokens { ROOT; ARRAY; LENGTH; DATA; } expr : array* EOF -> ^(ROOT array*) ; array @init { int index = 0; } : array_length array_data[$array_length.value] -> ^(ARRAY array_length array_data) ; array_length returns [int value] : a=hex_byte b=hex_byte {$value = $a.value*16*16 + $b.value;} -> ^(LENGTH hex_byte hex_byte) ; array_data [int length] : ({length > 0}?=> hex_byte {length--;})* {length == 0}? -> ^(DATA hex_byte*) ; hex_byte returns [int value] : a=HEX_DIGIT b=HEX_DIGIT {$value = Integer.parseInt($a.text+$b.text, 16);} ; HEX_DIGIT : '0'..'9' | 'a'..'f' | 'A'..'F' ; 

will analyze the following input:

 0001ff0002abcd 

to the following AST:

enter image description here

as you can see using the following main class:

 import org.antlr.runtime.*; import org.antlr.runtime.tree.*; import org.antlr.stringtemplate.*; public class Main { public static void main(String[] args) throws Exception { FlexByteArray_HexGrammarLexer lexer = new FlexByteArray_HexGrammarLexer(new ANTLRStringStream("0001ff0002abcd")); FlexByteArray_HexGrammarParser parser = new FlexByteArray_HexGrammarParser(new CommonTokenStream(lexer)); CommonTree tree = (CommonTree)parser.expr().getTree(); DOTTreeGenerator gen = new DOTTreeGenerator(); StringTemplate st = gen.toDOT(tree); System.out.println(st); } } 

More details

EDIT

To give a brief description of the array_data rule:

 array_data [int length] : ({length > 0}?=> hex_byte {length--;})* {length == 0}? -> ^(DATA hex_byte*) ; 

As you mentioned in the comments, you can pass one or more parameters to the rules by adding [TYPE IDENTIFIER] after the rule.

The first (gated) semantic predicate, {length > 0}?=> , Checks to see if length greater than. If so, the parser tries to match a hex_byte , after which the length variable is reduced by one. All this stops when length is zero or when the parser no longer has the hex_byte that occurs when the EOF is in the queue. Since it can parse less than the required number of hex_byte s, is there a (verifying) semantic predicate {length == 0}? at the very end of the rule {length == 0}? , which ensures that the correct amount of hex_byte been parsed (no more, no less!).

Hope this clarifies a bit.

+6
source

Source: https://habr.com/ru/post/886839/


All Articles