How do you write a lexer parser where identifiers can start with keywords?

Question

How do you write a lexer parser where identifiers can start with keywords?

Suppose you have a language in which identifiers can begin with keywords. For example, suppose case is a keyword, but caser is a valid identifier. Suppose also that lexer rules can only handle regular expressions. Then it seems that I cannot place the keyword rules before the identifier rule in lexer, because it will parse "caser" as a "case", followed by "r". I also cannot post keyword lexing rules after the identifier rule, because the identifier rule will match the keywords and the keyword rules will never run.

So, instead, I could create a keyword_or_identifier rule in lexer and analyze if the keyword_or_identifier is a keyword or an identifier. Is this what is usually done?

I understand that “using a different lexer with a look” is the answer (sort of), but I'm also interested in how this is done in a traditional DFA-based lexer, since my current lexer seems to work the way.

+4

keyword parsing dfa lexer

Benri May 03 '13 at 11:28

source share

1 answer

rici · Accepted Answer · 2013-05-05T23:04:54+0000

Most lexers, starting with the original lex , correspond to alternatives as follows:

Use the longest match.
If there are two or more alternatives that link for the longest match, use the first in the lexer definition.

This allows you to use the following style:

 "case" { return CASE; } [[:alpha:]][[:alnum:]]* { return ID; }

If the input pattern is caser , then the second option will be used because it is the longest match. If the input template is case r , then the first option will be used, because both of them correspond to case , and according to rule (2) the first wins above.

Although this may seem a little arbitrary, it is mostly consistent with the DFA approach. First of all, DFA does not stop when it first reaches the receiving state. If that were the case, patterns like [[:alpha:]][[:alnum:]]* would be useless because they enter the receiving state on the first character (provided its alphabet). Instead, DFA-based lexers continue until there are no possible transitions from the current state, and then they return to the last receiving state. (See below.)

This DFA state may be accepted due to two different rules, but this is not a problem either; only the first receiving rule is recorded.

To be fair, this is slightly different from the DFA mathematical model, which has a transition for each symbol in each state (although many of them may be transitions to the "sink" state) and which corresponds to the full one depending on whether the machine is in the receiving state when the last input character is read. The lexer model is slightly different, but it can be easily formalized.

The only difficulty of the theoretical model is "returning to the last receiving state." In practice, this is usually done by storing the state and input position each time a receiving state is reached. This means that you may need to rewind the input stream, possibly an arbitrary amount.

Most languages do not require backups very often, and very few require unlimited backups. Some lexer generators may generate faster code if there are no fallback states. ( flex will do this if you use -Cf or -Cf .)

One common case, which leads to undefined backups, does not provide an appropriate error return for string literals:

 ["][^"\n]*["] { return STRING; } /* ... */ . { return INVALID; }

Here, the first pattern will correspond to a string literal starting with " if there is a corresponding in the same string " . (I skipped \ elements for simplicity.) If the string literal is not destroyed, the last template will match, but the input should be rewound to " . In most cases, it makes no sense to continue lexical analysis, ignoring the unsurpassed " ; it would be wiser to just ignore the rest of the line. Thus, support is not only ineffective, but can also lead to an explosion of false error messages. A better solution might be:

 ["][^"\n]*["] { return STRING; } ["][^"\n]* { return INVALID_STRING; }

Here, the second alternative can be successful only if the line is not destroyed, because if the line is completed, the first alternative will correspond to another character. Therefore, it doesn’t even matter in which order these alternatives appear, although everyone I know would put them in the same order as me.

How do you write a lexer parser where identifiers can start with keywords?

More articles: