There are three ways to do this, described in the JavaCC Frequently Asked Questions .
- One of them is to use lexical states as you did. This method may be complicated, but it is the only way to deal with situations where the length of the longest match depends on the context or where skipping rules depend on the context. For your problem, this is probably harder than you need.
- The second is to use one type of token and use a semantic lookahead based on the marker image to force the analyzer to process some tokens specifically in some cases. See the FAQ section for more details.
- The third (and usually the simplest) approach is to make differences at the lexical level and then ignore the differences at the syntactic level. This is usually the best way to deal with keywords that may be double as identifiers.
Below I will give three examples of the third approach.
Using keywords as identifiers
If all you want to do is allow the use of the class keyword as a variable name, there is a very simple way to do this. The lexer introduces the usual rules.
TOKEN: { <CLASS: "class"> } TOKEN: { < VARNAME: ["a-"z","A"-Z"](["a-"z","A"-Z"])* > }
In the parser write
Token varName() { Token t ; } : { { (t = <CLASS> | t = <VARNAME>) {return t ;} }
Then use varName() elsewhere in the parser.
Original poster assembler
Turning to the assembler example in the original question, consider the JPC example as an example. The JPC (Jump conditional) statement is followed by a comparison operator such as Z, B, etc., and then an operand, which can be a lot of things, including identifiers. For instance. we could have
JPC Z fred
But we could also have an identifier named JPC or Z, so
JPC Z JPC
and
JPC ZZ
also valid JPC instructions.
In the lexical part we have
TOKEN : // Opcodes { <I_CAL: "CAL"> | <I_JPC: "JPC"> | ... // other op codes <CMP_OP: "Z" | "B" | "BE" | "A" | "AE" | "NZ"> | <T_REGISTER : "R0" | "R1" | "R2" | "R3" | "RP" | "RF" |"RS" | "RB"> } ... // Other lexical rules. TOKEN : // Be sure this rule comes after all keywords. { < IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>)* > }
In the parser we have
Instruction Instruction():{ Instruction inst = new Instruction(); Token o = null,dataType = null,calType = null,cmpType = null; Operand a = null,b = null; } { ... o = <I_JPC> cmpType = <CMP_OP> a = Operand() ... } Operand Operand():{ Token t ; ... } { t = <T_REGISTER> ... | t = Identifier() ... ... } Token Identifier : { Token t ; } { t = <IDENTIFIER> {return t ;} | t = <I_CAL> {return t ;} | t = <I_JPC> {return t ;} | t = <CMP_OP> {return t ;} | ...
I would suggest excluding case names from the list of other keywords that can be used as identifiers.
If you included <T_REGISTER> in this list, then there will be uncertainty in the operand, because Operand looks like
Operand Operand():{ Token t ; ... } { t = <T_REGISTER> ... | t = Identifier() ... ... }
Now there is ambiguity because
JPC Z R0
has two parses. In the context of being an operand, we want tokens like "R0" to be parsed as registers, not identifiers. Fortunately, JavaCC will prefer earlier options, so this is exactly what will happen. You will receive a warning from JavaCC. You can ignore the warning. (I am adding a comment to the source code so that other programmers do not worry.) Or you can suppress the warning using the view specification.
Operand Operand():{ Token t ; ... } { LOOKAHEAD(1) t = <T_REGISTER> ... | t = Identifier() ... ... }
Using the right context
So far, all examples have used the left context. That is, we can tell how to treat a marker based solely on the sequence of tokens to its left. Let's look at the case when the interpretation of the keyword is based on the tokens on the right.
Consider this simple imperative language in which all keywords can be used as variable names.
P -> Block <EOF> Block -> [S Block] S -> Assignment | IfElse Assignment -> LHS ":=" Exp LHS -> VarName IfElse -> "if" Exp Block ["else" Block] "end" Exp -> VarName VarName -> <ID> | if | else | end
This grammar is unambiguous. You can make the grammar more complex by adding new kinds of operators, expressions and left sides; as long as the grammar remains unambiguous, such complications will probably not matter much for what I am going to say next. Feel free to experiment.
Grammar is not LL (1). There are two places where the choice should be made based on more than one future token. One of them is the choice between Assignment and IfElse , when the next token is "if". Let's consider the block
a := b if := a
vs
a := b if q b := c end
We can look forward to ": =", like this
void S() : {} { LOOKAHEAD( LHS() ":=" ) Assignment() | IfElse() }
Another place we need to look ahead is when βelseβ or βendβ is encountered at the beginning of the block. Consider
if x end := y else := z end
We can solve it with
void Block() : {} { LOOKAHEAD( LHS() ":=" | "if" ) S() Block() | {} }