Do I have LL (1) grammar for this XML subset?

I'm going to make a parser for a fictional subset of XML with the following EBNF grammar:

DOCUMENT ::= ELEMENT ELEMENT ::= START_TAG (ELEMENT | DATA)* END_TAG | EMPTY_TAG START_TAG ::= < NAME ATTRIBUTE* > END_TAG ::= </ NAME > EMPTY_TAG ::= < NAME ATTRIBUTE* /> ATTRIBUTE ::= NAME = STRING 

Above was an “as is” grammar without any changes. Here is my attempt to convert it to LL (1):

 DOCUMENT ::= ELEMENT EOF ELEMENT ::= PREFIX > ELEMENT_OR_DATA END_TAG | PREFIX /> PREFIX ::= < NAME OPT_ATTR ELEMENT_OR_DATA ::= OPT_ELEMENT ELEMENT_OR_DATA | OPT_DATA ELEMENT_OR_DATA | epsilon OPT_ELEMENT ::= ELM_LIST | epsilon ELM_LIST ::= ELEMENT | ELEMENT ELM_LIST OPT_DATA ::= DATA_LIST | epsilon DATA_LIST ::= DATA | DATA DATA_LIST END_TAG ::= </ NAME > OPT_ATTR ::= ATTR_LIST | epsilon ATTR_LIST ::= ATTRIBUTE | ATTRIBUTE ATTR_LIST ATTRIBUTE ::= NAME = STRING EOF ::= &$ 

Is this a version of the original LL (1)? If not, where did I go wrong? And if so, is there a way to “simplify” without changing the meaning? I'm not sure I have the simplest version.

I hope this is clear.

+2
source share
1 answer

The LL (1) parser cannot select the correct rule for the two ELEMENT rules simply by looking at the next token. According to the grammar, the parser must try the first rule: ELEMENT ::= PREFIX > ELEMENT_OR_DATA END_TAG And if it does not work, it must return from backtracking to try the second rule: ELEMENT ::= PREFIX />

The problem is that both rules start with the same PREFIX “object”. In this case, it is even “worse” because it is not terminal.

Definitely, this is not LL grammar (1). Let's try to build one.

First, we simplify the original grammar by removing TAG: DOCUMENT ::= ELEMENT ELEMENT ::= < NAME ATTRIBUTE* > (ELEMENT | DATA)* </ NAME > ELEMENT ::= < NAME ATTRIBUTE* /> ATTRIBUTE ::= NAME = STRING

The next step is to split the rules for ELEMENT to get the first token, which will help the parser to choose the right rule. DOCUMENT ::= ELEMENT ELEMENT ::= < NAME ATTRIBUTE* ELEMENT1 ELEMENT1 ::= > (ELEMENT | DATA)* </ NAME > ELEMENT1 ::= /> ATTRIBUTE ::= NAME = STRING

Now the parser can successfully start parsing the element. He “defers” the decision about whether it is an expanded or short element, and delegates this question to the rules of ELEMENT1. A later one can determine what type of element is being analyzed, checking if the next token is > or /> .

Continue the conversion: DOCUMENT ::= ELEMENT ELEMENT ::= < NAME ATTRIBUTES ELEMENT1 ELEMENT1 ::= > ELEMENT_OR_DATA </ NAME > ELEMENT1 ::= /> ELEM_OR_DATA ::= ELEMENT ELEM_OR_DATA ELEM_OR_DATA ::= DATA ELEM_OR_DATA ELEM_OR_DATA ::= epsilon ATTRIBUTES ::= NAME = STRING ATTRIBUTES ATTRIBUTES ::= epsilon

We simply replaced the * operator with the corresponding LL syntax. This last grammar still has some problem: the first two ELEM_OR_DATA rules can “confuse” the parser, because it cannot guess which one to apply (a similar problem for the one we discussed at the beginning).

Solve this problem by specifying a parser: DOCUMENT ::= ELEMENT EOF ELEMENT ::= < ELEMENT0 ELEMENT0 ::= NAME ATTRIBUTES ELEMENT1 ELEMENT1 ::= > ELEMENT_OR_DATA </ NAME > ELEMENT1 ::= /> ELEM_OR_DATA ::= < ELEMENT0 ELEM_OR_DATA ELEM_OR_DATA ::= DATA ELEM_OR_DATA ELEM_OR_DATA ::= epsilon ATTRIBUTES ::= NAME = STRING ATTRIBUTES ATTRIBUTES ::= epsilon

We split ELEMENT1 and used ELEMENT0 in the first rule ELEM_OR_DATA. Now, assuming DATA is a token, the parser can easily determine which rule to apply by considering only the next token.

+2
source

Source: https://habr.com/ru/post/1011669/


All Articles