Triangular Grammar Treetop

I had some ideas for a new programming language floating around in my head, so I thought I'd try to implement it. A friend suggested I try using Treetop (Ruby Stone) to create a parser. The Treetop documentation is sparse, and I've never done it before.

My parser acts as if it has an infinite loop, but without stack traces; it’s hard to track. Can someone point me in the direction of entry-level guidance / AST? I really need something that list rules, normal use, etc. To use tools such as Treetop. My parser grammar is on GitHub if someone wants to help me improve it.

class { initialize = lambda (name) { receiver.name = name } greet = lambda { IO.puts("Hello, #{receiver.name}!") } }.new(:World).greet() 
+5
source share
2 answers

I asked treetop to compile your language into a .rb file. This gave me something to delve into:

 $ tt -o /tmp/rip.rb /tmp/rip.treetop 

Then I used this little stub to recreate the loop:

 require 'treetop' load '/tmp/rip.rb' RipParser.new.parse('') 

It freezes. Now, this is not so interesting! An empty line reproduces the behavior in the same way as an example of a dozen or a line in your question.

To find out where it hangs, I used the Emacs keyboard macro to edit rip.rb, adding a debug statement to each method entry. For instance:

 def _nt_root p [__LINE__, '_nt_root'] #DEBUG start_index = index 

Now we can see the scope of the loop:

 [16, "root"] [21, "_nt_root"] [57, "_nt_statement"] ... [3293, "_nt_eol"] [3335, "_nt_semicolon"] [3204, "_nt_comment"] [57, "_nt_statement"] [57, "_nt_statement"] [57, "_nt_statement"] ... 

Further debugging from there shows that the integer is allowed to be an empty string:

 rule integer digit* end 

This indirectly allows us to say that this is an empty string, and the top-level rule statement* for the eternal use of empty operators. Changing * to + fixes the loop, but detects another problem:

 /tmp/rip.rb:777:in `_nt_object': stack level too deep (SystemStackError) from /tmp/rip.rb:757:in `_nt_compound_object' from /tmp/rip.rb:1726:in `_nt_range' from /tmp/rip.rb:1671:in `_nt_special_literals' from /tmp/rip.rb:825:in `_nt_literal_object' from /tmp/rip.rb:787:in `_nt_object' from /tmp/rip.rb:757:in `_nt_compound_object' from /tmp/rip.rb:1726:in `_nt_range' from /tmp/rip.rb:1671:in `_nt_special_literals' ... 3283 levels... 

The left-recursive range is indirectly through special_literals, literal_object, object, and complex_object. A tritop, faced with left recursion, eats a stack until it kills. I do not have a quick fix for this problem, but at least you have a stack trace to go from it.

In addition, this is not your immediate problem, but the definition of digit is odd: it can be either one digit or several. This calls digit* or digit+ to resolve the (supposedly) integer 1________2 .

+7
source

I really liked Parr's Language Implementation Templates ; since Parr created the ANTLR parser generator , it is a tool that he uses throughout the book, but it should be simple enough to learn the same thing from it.

I really liked that each example grew on the previous one; he does not start with a gigantic AST-compatible parsing; instead, he slowly introduces problems requiring more and more β€œbackend smarts” to do the job, so the book scales well with a language that needs parsing.

What I want it to be covered in a slightly greater degree is the types of languages ​​in which you can write and give advice on Do's and Do not Do's when developing languages. I saw some languages ​​that are a huge pain for parsing, and I would like to know more about design decisions that could be made differently.

+1
source

Source: https://habr.com/ru/post/888973/


All Articles