Most lexers, starting with the original lex , correspond to alternatives as follows:
This allows you to use the following style:
"case" { return CASE; } [[:alpha:]][[:alnum:]]* { return ID; }
If the input pattern is caser , then the second option will be used because it is the longest match. If the input template is case r , then the first option will be used, because both of them correspond to case , and according to rule (2) the first wins above.
Although this may seem a little arbitrary, it is mostly consistent with the DFA approach. First of all, DFA does not stop when it first reaches the receiving state. If that were the case, patterns like [[:alpha:]][[:alnum:]]* would be useless because they enter the receiving state on the first character (provided its alphabet). Instead, DFA-based lexers continue until there are no possible transitions from the current state, and then they return to the last receiving state. (See below.)
This DFA state may be accepted due to two different rules, but this is not a problem either; only the first receiving rule is recorded.
To be fair, this is slightly different from the DFA mathematical model, which has a transition for each symbol in each state (although many of them may be transitions to the "sink" state) and which corresponds to the full one depending on whether the machine is in the receiving state when the last input character is read. The lexer model is slightly different, but it can be easily formalized.
The only difficulty of the theoretical model is "returning to the last receiving state." In practice, this is usually done by storing the state and input position each time a receiving state is reached. This means that you may need to rewind the input stream, possibly an arbitrary amount.
Most languages โโdo not require backups very often, and very few require unlimited backups. Some lexer generators may generate faster code if there are no fallback states. ( flex will do this if you use -Cf or -Cf .)
One common case, which leads to undefined backups, does not provide an appropriate error return for string literals:
["][^"\n]*["] { return STRING; } /* ... */ . { return INVALID; }
Here, the first pattern will correspond to a string literal starting with " if there is a corresponding in the same string " . (I skipped \ elements for simplicity.) If the string literal is not destroyed, the last template will match, but the input should be rewound to " . In most cases, it makes no sense to continue lexical analysis, ignoring the unsurpassed " ; it would be wiser to just ignore the rest of the line. Thus, support is not only ineffective, but can also lead to an explosion of false error messages. A better solution might be:
["][^"\n]*["] { return STRING; } ["][^"\n]* { return INVALID_STRING; }
Here, the second alternative can be successful only if the line is not destroyed, because if the line is completed, the first alternative will correspond to another character. Therefore, it doesnโt even matter in which order these alternatives appear, although everyone I know would put them in the same order as me.