Can JavaCC distinguish a token in its context?

The main requirement is to use the keyword as an identifier, so I want to distinguish the token from its context (for example, class is a keyword, but we allowed a variable called class ).

In java this is possible, but it is so complicated here , how I do it

 TOKEN : { <I_CAL: "CAL"> : DO_CAL | <I_CALL: "CALL"> | <I_CMP: "CMP"> | <I_EXIT: "EXIT"> | <I_IN: "IN"> | <I_JMP: "JMP"> | <I_JPC: "JPC"> : NEED_CMP_OP | <I_LD: "LD"> : NEED_DATA_TYPE | <I_NOP: "NOP"> | <I_OUT: "OUT"> | <I_POP: "POP"> | <I_PUSH: "PUSH"> | <I_RET: "RET"> | <I_DATA: "DATA"> : DO_DATA | <I_BLOCK: ".BLOCK"> } // T prefix for Token TOKEN : { <T_REGISTER : "R0" | "R1" | "R2" | "R3" | "RP" | "RF" |"RS" | "RB"> // We need below TOKEN in special context, other wise they are just IDENTIFIER // | <DATA_TYPE: "DWORD" | "WORD" | "BYTE" | "FLOAT" | "INT"> // | <PSEUDO_DATA_TYPE: "CHAR" > // | <CAL_OP: "ADD" | "SUB" | "MUL" | "DIV" | "MOD"> // | <CMP_OP: "Z" | "B" | "BE" | "A" | "AE" | "NZ"> | <T_LABEL: <IDENTIFIER> ([" "])* <COLON>> } // Now we need a CMP OP <NEED_CMP_OP> TOKEN: { <CMP_OP: "Z" | "B" | "BE" | "A" | "AE" | "NZ"> : DEFAULT } // Now we need a DATA TYPE <NEED_DATA_TYPE,DO_CAL> TOKEN: { // EXTENSION Add char to data type <DATA_TYPE: "DWORD" | "WORD" | "BYTE" | "FLOAT" | "INT" | "CHAR"> { if(curLexState == DO_CAL){ SwitchTo(NEED_CAL_OP); }else{ SwitchTo(DEFAULT); } } } // We need a CAL OP <NEED_CAL_OP> TOKEN: { <CAL_OP: "ADD" | "SUB" | "MUL" | "DIV" | "MOD"> : DEFAULT } // Aslo need to skip the empty <NEED_DATA_TYPE,NEED_CAL_OP,NEED_CMP_OP,DO_CAL,DO_DATA> SKIP: { " " | "\t" | "\r" | "\f" } 

Source here , I can distinguish the token from the curLexState context.

It works, but fussy, you need to add a lot of additional state and maintain a lot of state. Is there any easy way to achieve this?

+5
source share
2 answers

There are three ways to do this, described in the JavaCC Frequently Asked Questions .

  • One of them is to use lexical states as you did. This method may be complicated, but it is the only way to deal with situations where the length of the longest match depends on the context or where skipping rules depend on the context. For your problem, this is probably harder than you need.
  • The second is to use one type of token and use a semantic lookahead based on the marker image to force the analyzer to process some tokens specifically in some cases. See the FAQ section for more details.
  • The third (and usually the simplest) approach is to make differences at the lexical level and then ignore the differences at the syntactic level. This is usually the best way to deal with keywords that may be double as identifiers.

Below I will give three examples of the third approach.


Using keywords as identifiers

If all you want to do is allow the use of the class keyword as a variable name, there is a very simple way to do this. The lexer introduces the usual rules.

 TOKEN: { <CLASS: "class"> } TOKEN: { < VARNAME: ["a-"z","A"-Z"](["a-"z","A"-Z"])* > } // Or what you will 

In the parser write

 Token varName() { Token t ; } : { { (t = <CLASS> | t = <VARNAME>) {return t ;} } 

Then use varName() elsewhere in the parser.


Original poster assembler

Turning to the assembler example in the original question, consider the JPC example as an example. The JPC (Jump conditional) statement is followed by a comparison operator such as Z, B, etc., and then an operand, which can be a lot of things, including identifiers. For instance. we could have

 JPC Z fred 

But we could also have an identifier named JPC or Z, so

 JPC Z JPC 

and

 JPC ZZ 

also valid JPC instructions.

In the lexical part we have

 TOKEN : // Opcodes { <I_CAL: "CAL"> | <I_JPC: "JPC"> | ... // other op codes <CMP_OP: "Z" | "B" | "BE" | "A" | "AE" | "NZ"> | <T_REGISTER : "R0" | "R1" | "R2" | "R3" | "RP" | "RF" |"RS" | "RB"> } ... // Other lexical rules. TOKEN : // Be sure this rule comes after all keywords. { < IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>)* > } 

In the parser we have

 Instruction Instruction():{ Instruction inst = new Instruction(); Token o = null,dataType = null,calType = null,cmpType = null; Operand a = null,b = null; } { ... o = <I_JPC> cmpType = <CMP_OP> a = Operand() ... } Operand Operand():{ Token t ; ... } { t = <T_REGISTER> ... | t = Identifier() ... ... } Token Identifier : { Token t ; } { t = <IDENTIFIER> {return t ;} | t = <I_CAL> {return t ;} | t = <I_JPC> {return t ;} | t = <CMP_OP> {return t ;} | ... // All other keywords } 

I would suggest excluding case names from the list of other keywords that can be used as identifiers.

If you included <T_REGISTER> in this list, then there will be uncertainty in the operand, because Operand looks like

 Operand Operand():{ Token t ; ... } { t = <T_REGISTER> ... | t = Identifier() ... ... } 

Now there is ambiguity because

 JPC Z R0 

has two parses. In the context of being an operand, we want tokens like "R0" to be parsed as registers, not identifiers. Fortunately, JavaCC will prefer earlier options, so this is exactly what will happen. You will receive a warning from JavaCC. You can ignore the warning. (I am adding a comment to the source code so that other programmers do not worry.) Or you can suppress the warning using the view specification.

 Operand Operand():{ Token t ; ... } { LOOKAHEAD(1) t = <T_REGISTER> ... | t = Identifier() ... ... } 

Using the right context

So far, all examples have used the left context. That is, we can tell how to treat a marker based solely on the sequence of tokens to its left. Let's look at the case when the interpretation of the keyword is based on the tokens on the right.

Consider this simple imperative language in which all keywords can be used as variable names.

 P -> Block <EOF> Block -> [S Block] S -> Assignment | IfElse Assignment -> LHS ":=" Exp LHS -> VarName IfElse -> "if" Exp Block ["else" Block] "end" Exp -> VarName VarName -> <ID> | if | else | end 

This grammar is unambiguous. You can make the grammar more complex by adding new kinds of operators, expressions and left sides; as long as the grammar remains unambiguous, such complications will probably not matter much for what I am going to say next. Feel free to experiment.

Grammar is not LL (1). There are two places where the choice should be made based on more than one future token. One of them is the choice between Assignment and IfElse , when the next token is "if". Let's consider the block

 a := b if := a 

vs

 a := b if q b := c end 

We can look forward to ": =", like this

 void S() : {} { LOOKAHEAD( LHS() ":=" ) Assignment() | IfElse() } 

Another place we need to look ahead is when β€œelse” or β€œend” is encountered at the beginning of the block. Consider

 if x end := y else := z end 

We can solve it with

 void Block() : {} { LOOKAHEAD( LHS() ":=" | "if" ) S() Block() | {} } 
+3
source

If you combine a lexer and a parser into a character-oriented parser, then it is relatively easy to distinguish keywords in the context, because the parser is all about keeping the context. You can use JavaCC for symbolic tokens to achieve this effect, but its LL character will probably make it impossible to write practical grammars for other reasons.

If you separate the lexer and the parser, this is not easy.

You ask lexer to know when something is an identifier or a keyword that it can do, only knowing the context that the Id / keyword finds.

Ideally, the lexer would simply ask the parser for its state, and this will determine the contexts in which the choice is made. It is hard to organize; most parsers are not designed to easily or easily show their state or in a form that is easily interpreted to extract the necessary context signal. JavaCC is not explicitly organized in this way.

Your other obvious choice is to model different contexts as states in a lexer, with transitions between lexing states corresponding to transitions between interesting contexts. This may or may not be easy depending on the context. If you can do this, you need to code the states and transitions in your lexer and keep them up to date. When you can do it β€œeasily,” this is not a bad decision. This may be difficult or impossible depending on specific contexts.

For the purpose of OP (for the parser for assembler), the context is usually determined by the position in the source line. You can qualitatively divide assembler input into Label, Opcode, Operand, Comment labels by looking at spaces: a new line sets the context in Label, spaces in Label mode sets the context for Opcode, a space in Opcode sets the operand context and a space in the operand settings. Leave a comment context. In these state transitions, you can write different sublexors for each context, having different keywords in each sub-context.

This trick does not work for languages ​​like PL / I, which have a huge number of keywords in context (for PL / I, in fact, each keyword is only in context!).

An incomprehensible choice is not to try to distinguish between them at all. When the identifier / keyword is found, feed both tokens to the parser and let it sort which one leads to viable analysis. (Note: it can handle the cross-product of several ambiguous tokens, so during sorting it can be a lot of possible analyzes.) This requires a parser that can handle ambiguity, both in parsing and in the tokens it accepts (or it cannot simultaneously accept the identifier and the token of the keyword). This is a beautifully simple solution to use when you have the right parsing engine. JavaCC is not the right mechanism.

[Cm. my biography for the GLR parsing engine, in which all 3 solutions are easily accessible. It easily processes Pl / I.

+3
source

Source: https://habr.com/ru/post/1238909/


All Articles