JFlex matches nested comments as a single token

Question

JFlex matches nested comments as a single token

In Mathematica, a comment begins with (* and ends with *) , and comments can be nested. My current approach to checking comment with JFlex contains the following code

 %xstate IN_COMMENT "(*" { yypushstate(IN_COMMENT); return MathematicaElementTypes.COMMENT;} <IN_COMMENT> { "(*" {yypushstate(IN_COMMENT); return MathematicaElementTypes.COMMENT;} [^\*\)\(]* {return MathematicaElementTypes.COMMENT;} "*)" {yypopstate(); return MathematicaElementTypes.COMMENT;} [\*\)\(] {return MathematicaElementTypes.COMMENT;} . {return MathematicaElementTypes.BAD_CHARACTER;} }

where the yypushstate and yypopstate defined as

 private final LinkedList<Integer> states = new LinkedList(); private void yypushstate(int state) { states.addFirst(yystate()); yybegin(state); } private void yypopstate() { final int state = states.removeFirst(); yybegin(state); }

to give me the ability to track the number of nested comment levels I'm dealing with.

Unfortunately, this leads to several COMMENT tokens for a single comment, because I need to match the nested comment starts and the end of the comments.

Question: Is it possible that JFlex uses its API with methods like yypushback or advance() , etc., to return exactly one token to the entire range of comments, even if comments are nested?

+6

comments lex jflex grammar

halirutan Jul 10 '14 at 2:44

source share

3 answers

When I first wrote the answer, I did not even understand that one of the existing answers was by itself. On the other hand, I rarely find generosity in the rather small SO lex community. Thus, it seemed to me that it was enough for me to learn Java and jflex to write a sample:

 /* JFlex scanner: to recognize nested comments in Mathematica style */ %% %{ /* counter for open (nested) comments */ int open = 0; %} %state IN_COMMENT %% /* any state */ "(*" { if (!open++) yybegin(IN_COMMENT); } "*)" { if (open) { if (!--open) { yybegin(YYINITIAL); return MathematicaElementTypes.COMMENT; } } else { /* or return MathematicaElementTypes.BAD_CHARACTER; /* or: throw new Error("'*)' without '(*'!"); */ } } <IN_COMMENT> { . | \n { } } <<EOF>> { if (open) { /* This is obsolete if the scanner is instanced new for * each invocation. */ open = 0; yybegin(IN_COMMENT); /* Notify about syntax error, eg */ throw new Error("Premature end of file! (" + open + " open comments not closed.)"); } return MathematicaElementTypes.EOF; /* just a guess */ }

There may be typos and stupid mistakes, although I tried to be careful and did my best.

As a “proof of concept”, I leave here my initial implementation, which is done using flex and C / C ++.

This scanner

processes a comment (using printf() )
echoes everything else.

My solution is based on the fact that flex rules can end with break or return . Therefore, the token simply does not return until the rule for the template matches the closing of the last comment. The content in the comments is simply “written” to the buffer - in my case a std::string . (AFAIK, string is even a built-in type in Java. So I decided to mix C and C ++, which I usually would not want.)

My source scan-nested-comments.l :

 %{ #include <cstdio> #include <string> // counter for open (nested) comments static int open = 0; // buffer for collected comments static std::string comment; %} /* make never interactive (prevent usage of certain C functions) */ %option never-interactive /* force lexer to process 8 bit ASCIIs (unsigned characters) */ %option 8bit /* prevent usage of yywrap */ %option noyywrap %s IN_COMMENT %% "(*" { if (!open++) BEGIN(IN_COMMENT); comment += "(*"; } "*)" { if (open) { comment += "*)"; if (!--open) { BEGIN(INITIAL); printf("EMIT TOKEN COMMENT(lexem: '%s')\n", comment.c_str()); comment.clear(); } } else { printf("ERROR: '*)' without '(*'!\n"); } } <IN_COMMENT>{ . | "\n" { comment += *yytext; } } <<EOF>> { if (open) { printf("ERROR: Premature end of file!\n" "(%d open comments not closed.)\n", open); return 1; } return 0; } %% int main(int argc, char **argv) { if (argc > 1) { yyin = fopen(argv[1], "r"); if (!yyin) { printf("Cannot open file '%s'!\n", argv[1]); return 1; } } else yyin = stdin; return yylex(); }

I compiled it with flex and g ++ in cygwin on Windows 10 (64 bit):

 $ flex -oscan-nested-comments.cc scan-nested-comments.l ; g++ -o scan-nested-comments scan-nested-comments.cc scan-nested-comments.cc:398:0: warning: "yywrap" redefined ^ scan-nested-comments.cc:74:0: note: this is the location of the previous definition ^ $

A warning will appear due to %option noyywrap . I assume that this does not mean any harm and can be ignored.

Now I have done some tests:

 $ cat >good-text.txt <<EOF > Test for nested comments. > (* a comment *) > (* a (* nested *) comment *) > No comment. > (* a > (* nested > (* multiline *) > *) > comment *) > End of file. > EOF $ cat good-text | ./scan-nested-comments Test for nested comments. EMIT TOKEN COMMENT(lexem: '(* a comment *)') EMIT TOKEN COMMENT(lexem: '(* a (* nested *) comment *)') No comment. EMIT TOKEN COMMENT(lexem: '(* a (* nested (* multiline *) *) comment *)') End of file. $ cat >bad-text-1.txt <<EOF > Test for wrong comment. > (* a comment *) > with wrong nesting *) > End of file. > EOF $ cat >bad-text-1.txt | ./scan-nested-comments Test for wrong comment. EMIT TOKEN COMMENT(lexem: '(* a comment *)') with wrong nesting ERROR: '*)' without '(*'! End of file. $ cat >bad-text-2.txt <<EOF > Test for wrong comment. > (* a comment > which is not closed. > End of file. > EOF $ cat >bad-text-2.txt | ./scan-nested-comments Test for wrong comment. ERROR: Premature end of file! (1 open comments not closed.) $

+2

Scheff May 17 '17 at 12:52

source share

A traditional Java commentary is defined in a sample grammar using

 TraditionalComment = "/*" [^*] ~"*/" | "/*" "*"+ "/"

I believe this expression should work for Mathematica comments as well.

0

rds Sep 16 '14 at 18:57

source share

halirutan · Accepted Answer · 2017-05-13T05:00:11+0000

It seems that the generosity has been canceled, because the decision is so simple that I simply did not consider it. Let me explain. When scanning a simple nested comment

 (* (*..*) *)

I need to keep track of how many comment opening tokens I see, so finally, in the last valid closing comment, you can return the entire comment as a single token.

I didn’t understand that JFlex doesn’t need to talk to go to the next part when it matches something. After a thorough review, I saw that this is explained here , but somewhat hidden in a section that I did not care about:

Since we have not returned the value to the parser, our scanner will act immediately.

Hence a rule in the flex file like this

 [^\(\*\)]+ { }

reads all characters except those that may be the beginning / end of a comment and does nothing , but it advances to the next token .

This means that I can just do the following. In the YYINITIAL state, I have a rule that matches the initial comment, but it does nothing, and then switches the lexer to the IN_COMMENT state. In particular, it does not return any token:

 {CommentStart} { yypushstate(IN_COMMENT);}

Now we are in the IN_COMMENT state and there, I do the same. I eat all the characters, but never return the token. When I find a new comment on the discovery, I carefully push it onto the stack, but do nothing. Only when I click on the last comment, I know that I leave the IN_COMMENT state, and this is the only point where I finally return the token. Let's look at the rules:

 <IN_COMMENT> { {CommentStart} { yypushstate(IN_COMMENT);} [^\(\*\)]+ { } {CommentEnd} { yypopstate(); if(yystate() != IN_COMMENT) return MathematicaElementTypes.COMMENT_CONTENT; } [\*\)\(] { } . { return MathematicaElementTypes.BAD_CHARACTER; } }

What is it. Now, no matter how deep your comment is, you always get one single token containing the entire comment.

Now I'm confused, and I'm sorry for such a simple question.

Final note

If you do something like this, you should remember that you only return the token when you click the correct closing symbol. Therefore, you should definitely make a rule that catches the end of the file. In IDEA, this default behavior is to simply return the comment token, so you need a different line (useful or not, I want to end the grace):

  <<EOF>> { yyclearstack(); yybegin(YYINITIAL); return MathematicaElementTypes.COMMENT;}

JFlex matches nested comments as a single token

Final note

More articles: