Lexer / Parser ambiguity

How does a lexer solve this ambiguity?

/*/*/ 

Like this, that he doesn’t just say, oh yes, that the beginning of a multi-line comment, followed by another multi-line comment.

Wouldn't the greedy lexer just return the following tokens?

  • / *
  • / *
  • /

I'm in the midst of writing a shift-reduce parser for CSS, and yet this simple comment thing is in my way. You can read this question if you do not find more details.

UPDATE

Sorry for leaving this in the first place. I plan to add extensions to the CSS language in this form /* @ func ( args, ... ) */ , but I do not want to confuse an editor that understands CSS, but not this extension comment of mine. Therefore, the lexer simply cannot ignore comments.

+4
source share
6 answers

One way to do this is to use a lexer to enter a different internal state when faced with the first /* . For example, flex invokes these “launch conditions” (corresponding C-style comments are one example on this page).

+8
source

The simplest way is probably to comment lex as one single token, that is, do not allocate the START COMMENT token, but instead continue reading on the input until you can issue the COMMENT BLOCK token, which includes the entire bit /*(anything)*/ .

Since comments are not related to the actual analysis of the executable code, a lexer (or at least grouped into one token) is largely excluded for them. You don't care about token matches in the comment.

+6
source

In most languages, this is not ambiguous: the first slash and asterix are consumed to create the "start of a multi-line comment" token. It is followed by a slash, which is the “content” in the comment, and finally, the last two characters are the token “end of a multiline comment”.

Since the first 2 characters are consumed, the first asterix also cannot be used to create the end of the comment token. I just noted that it can create a second “start of comment” token ... oops, this can be a problem, depending on the amount of context available to the parser.

I am talking about tokens here, suggesting parser-based comment processing. But the same thing applies to the lexer, according to which the main rule starts with '/*' , and then does not stop until '*/' is found. In fact, lexical processing of the entire commentary will not be confused by the second “commentary beginning”.

+3
source

Use the regex algorithm, search from the beginning of the line, returning to the current location.

 if (chars[currentLocation] == '/' and chars[currentLocation - 1] == '*') { for (int i = currentLocation - 2; i >= 0; i --) { if (chars[i] == '/' && chars[i + 1] == '*') { // ....... } } } 

This is similar to using regexp /\*([^\*]|\*[^\/])\*/ greedy and upstream.

0
source

Since CSS does not support nested comments, your example is usually parsed into a single token, COMMENT . That is, the lexer will see /* as a start-comment marker, and then will consume everything before and including the sequence */ .

0
source

One way to solve this problem is to return your lexer:

 / * / * / 

And let your parser handle it from there. This is what I would probably do for most programming languages , since / and * can also be used for multiplication and other similar things that are too complicated for the lexer. A lexer should really just return elementary characters.

If the token is starting to be too context sensitive, then what you are looking for can be a very simple token.

However, CSS is not a programming language, so / and * cannot be overloaded. In fact, they cannot be used for anything other than commentary. Therefore, I would be very tempted to simply pass it all as a comment token, unless you have a good reason not to: /\*.*\*/

0
source

Source: https://habr.com/ru/post/1306864/


All Articles