ANTLR4 Lexer Matching Line Start End Line

Question

ANTLR4 Lexer Matching Line Start End Line

How to achieve Perl ^ and $ regex in ANLTR4 vocabulary? i.e. to match the beginning of a line and the end of a line without using any character.

I am trying to use the ANTLR4 lexer to match the # character at the beginning of the line, but not in the middle of the line. For example, to isolate and throw out all C ++ preprocessor directives, no matter which directive it ignores, # inside a string literal. (Usually, we can tokenize C ++ string literals to exclude # that appears in the middle of the string, but assuming we don't). This means that I only want to specify #. *? not bothering #if #ifndef #pragma etc.

In addition, the C ++ standard allows white space and multi-line comments before and after #, for example.

/* helo world*/ # /* hel l o */ /*world */ifdef .....

considered a valid preprocessor directive appearing on one line. (CRLF inside ML COMMENTs rush)

This is what I am doing now:

 PPLINE: '\r'? '\n' (ML_COMMENT | '\t' | '\f' |' ')* '#' (ML_COMMENT | ~[\r\n])+ -> channel(PPDIR);

But the problem is that I have to rely on the existence of CRLF before # and send out that CRLF is generally with a directive. I need to replace the CRLF reset by CRLF with this directory line, so I have to make sure the directive is completed by CRLF.

However, this means that my grammar cannot process a directive that appears right at the beginning of the file (i.e. does not precede CRLF) or precedes EOF without ending CRLF.

If the Perl-style regex ^ $ syntax is available, I can map SOL / EOL instead of explicitly matching and using CRLF.

+4

regex antlr4

Javaman May 05 '13 at 8:03

source share

2 answers

You can try to have several rules with closed semantics ( Different lexer rules in different states ) or using modes (pushMode → http://www.antlr.org/wiki/display/ANTLR4/Lexer+Rules ), having an alternative rule to start file, and then switching to the basic rules when directives end, but it can be a long job.

Firstly, perhaps I would try if there really were problems parsing pragma / preprocessor directives without changing anything, because, for example, if the problem with search # can be present in lines and comments, then simply by streamlining the rule, which you should direct to the right thing (but this can be a problem for languages where you can put directives in comments).

+1

lunadir May 05 '13 at 10:47

source share

Sam harwell · Accepted Answer · 2013-05-05T17:37:35+0000

You can use semantic predicates for conditions.

 PPLINE : {getCharPositionInLine() == 0}? (ML_COMMENT | '\t' | '\f' |' ')* '#' (ML_COMMENT | ~[\r\n])+ {_input.LA(1) == '\r' || _input.LA(1) == '\n'}? -> channel(PPDIR) ;

ANTLR4 Lexer Matching Line Start End Line

More articles: