Incomplete grammar parsing

Are there any general solutions how to use incomplete grammar? In my case, I just want to detect methods in Delphi (Pascal) files, which means procedures and functions . Next first attempt works

  methods : ( procedure | function | . )+ ; 

but is this a decision at all? Are there any better solutions? Is it possible to stop parsing with an action (for example, after detecting implementation ). Does it make sense to use a preprocessor? And when yes - how?

+4
source share
2 answers

If you are only looking for names, then something simple:

 grammar PascalFuncProc; parse : (Procedure | Function)* EOF ; Procedure : 'procedure' Spaces Identifier ; Function : 'function' Spaces Identifier ; Ignore : (StrLiteral | Comment | .) {skip();} ; fragment Spaces : (' ' | '\t' | '\r' | '\n')+; fragment Identifier : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*; fragment StrLiteral : '\'' ~'\''* '\''; fragment Comment : '{' ~'}'* '}'; 

will do the trick. Please note that I am not very familiar with Delhpi / Pascal, so I am sure that it works with StrLiteral and / or Comment s, but this will be easily fixed.

The lexer generated from the above grammar will generate only two types of tokens ( Procedure and Function s), the rest of the input (string literals, comments, or if nothing matches, one character: t25>) is immediately dropped from the lexer ( skip() method) .

To enter:

 some valid source { function NotAFunction ... } procedure Proc Begin ... End; procedure Func Begin s = 'function NotAFunction!!!' End; 

The following parsing tree is created:

enter image description here

+4
source

What you are asking is called island grammars . The concept is that you define a parser for that part of the language you care about ("island") with all the classic symbols needed for that part, and that you define an extremely sloppy parser to skip the rest ("ocean") into which the island is built). One common trick for doing this is to identify sloppy lexers that collect a huge amount of material, respectively (to skip the HTML code to the embedded code, you can skip past anything that doesn't look like a script tag in lexer, for example).

The ANTLR site even discusses some of the problems associated with this , but it especially talks about the examples included in the ANTLR. I have no experience with ANTLR, so I don’t know how useful this specific information is.

Having created many tools that parsers use to parse / convert code (check my bio), I'm a little pessimistic about the overall usefulness of island gramplanes. If your goal is not to do something pretty trivial with a parsing island, you will need to collect the meaning of all the identifiers that it uses directly or indirectly ... and most of them, unfortunately, are for you in the ocean. So IMHO, you largely have to disassemble the ocean in order to overcome trivial tasks. You will also have other problems making sure you are really skipping stuff on the island; this pretty much means that your ocean lexer knows about spaces, comments, and all the picky syntax of character strings (it's harder than it looks with modern languages) so that they are correctly skipped. YMMV.

+3
source

Source: https://habr.com/ru/post/1369138/


All Articles