The LL (1) parser cannot select the correct rule for the two ELEMENT rules simply by looking at the next token. According to the grammar, the parser must try the first rule: ELEMENT ::= PREFIX > ELEMENT_OR_DATA END_TAG
And if it does not work, it must return from backtracking to try the second rule: ELEMENT ::= PREFIX />
The problem is that both rules start with the same PREFIX “object”. In this case, it is even “worse” because it is not terminal.
Definitely, this is not LL grammar (1). Let's try to build one.
First, we simplify the original grammar by removing TAG: DOCUMENT ::= ELEMENT ELEMENT ::= < NAME ATTRIBUTE* > (ELEMENT | DATA)* </ NAME > ELEMENT ::= < NAME ATTRIBUTE* /> ATTRIBUTE ::= NAME = STRING
The next step is to split the rules for ELEMENT to get the first token, which will help the parser to choose the right rule. DOCUMENT ::= ELEMENT ELEMENT ::= < NAME ATTRIBUTE* ELEMENT1 ELEMENT1 ::= > (ELEMENT | DATA)* </ NAME > ELEMENT1 ::= /> ATTRIBUTE ::= NAME = STRING
Now the parser can successfully start parsing the element. He “defers” the decision about whether it is an expanded or short element, and delegates this question to the rules of ELEMENT1. A later one can determine what type of element is being analyzed, checking if the next token is >
or />
.
Continue the conversion: DOCUMENT ::= ELEMENT ELEMENT ::= < NAME ATTRIBUTES ELEMENT1 ELEMENT1 ::= > ELEMENT_OR_DATA </ NAME > ELEMENT1 ::= /> ELEM_OR_DATA ::= ELEMENT ELEM_OR_DATA ELEM_OR_DATA ::= DATA ELEM_OR_DATA ELEM_OR_DATA ::= epsilon ATTRIBUTES ::= NAME = STRING ATTRIBUTES ATTRIBUTES ::= epsilon
We simply replaced the * operator with the corresponding LL syntax. This last grammar still has some problem: the first two ELEM_OR_DATA rules can “confuse” the parser, because it cannot guess which one to apply (a similar problem for the one we discussed at the beginning).
Solve this problem by specifying a parser: DOCUMENT ::= ELEMENT EOF ELEMENT ::= < ELEMENT0 ELEMENT0 ::= NAME ATTRIBUTES ELEMENT1 ELEMENT1 ::= > ELEMENT_OR_DATA </ NAME > ELEMENT1 ::= /> ELEM_OR_DATA ::= < ELEMENT0 ELEM_OR_DATA ELEM_OR_DATA ::= DATA ELEM_OR_DATA ELEM_OR_DATA ::= epsilon ATTRIBUTES ::= NAME = STRING ATTRIBUTES ATTRIBUTES ::= epsilon
We split ELEMENT1 and used ELEMENT0 in the first rule ELEM_OR_DATA. Now, assuming DATA is a token, the parser can easily determine which rule to apply by considering only the next token.