Studying a bison: what are context-free grammars and LALR (1)?

Question

Studying a bison: what are context-free grammars and LALR (1)?

I read this bison introduction.

I have two questions, and it will be great if someone helps me understand:

What does the term context free grammar mean?
From the above link: not all languages without a context can be processed by Bison, only those that are LALR (1). In short, this means that it should be possible to tell how to parse any part of the input string with just one wait character. What does this mean: "you can tell how to parse any part of the input string with just one forward search character"

+6

algorithm parsing bison grammar lalr

hari Aug 24 '11 at 16:51

source share

2 answers

1) In simple words. This means that the formation and context of the code is not important for individual parts, and you do not see how it is formatted to understand the meaning. For example, you can write your C program in one line or with each word in another line, and the program will mean the same thing. Wikipedia have a more formal definition.

2) That LALR (1) means more complex, but in simple words I would describe it as understanding the meaning step by step reading one word at a time - only viewing the next word / symbol - some spoken languages are famous for being not LALR (1) - where you don’t understand that the sentence is a question or expression when you get to the end of the sentence. Some programming languages can be created in the same way, but I don’t know anyone, because for practical reasons, everyone follows the LALR (1) syntax. However, I think there are exceptions when C / C ++ needs 2 characters in order to parse these statements correctly (therefore, this is LALR (2), but since I cannot remember from the very beginning that it such, I hope someone points out this in a comment.

In any case, this book here is a bible when it comes to understanding compilers and parsers.

+1

Soren Aug 24 '11 at 17:10

source share

templatetypedef · Accepted Answer · 2011-08-24T17:12:53+0000

A context-free grammar is a description of a set of strings that is strictly more powerful than regular expressions, but is still easily processed by the machine. A more formal context-free grammar consists of four things:

A set of terminal characters that are elements of the generated strings. For a bison analyzer, this is usually a set of tokens created by the scanner. For grammar for a natural language such as English, this can be a collection of all English words.
A set of nonterminal characters . Intuitively, a nonterminal symbol is something like a part of speech, such as a “noun” or “verb”.
A set of productions. . Each work tells how to replace a nonterminal symbol with another set of terminals and nonterminals. For example, the product Variable -> Type Name says that if we see the Variable nonterminal, we can replace it with the Type Name string.
The character start < , which is the nonterminal from which the output begins.

As an example, consider this simple context-free grammar for declaring C-style functions:

 Function -> Type ident(Arguments) Type -> int Type -> Type * Arguments -> e (the empty string) Arguments -> ArgList ArgList -> Type ident ArgList -> Type ident, ArgList

Here is the start character of Function . Given this grammar, we can create declarations of functions of style C by repeatedly choosing a nonterminal symbol and replacing it with one of the right sides of the corresponding product. At each step, the line that we have created so far is called a conditional form . For example, here are a few different function declarations that can be obtained from the grammar above:

 Sentential Form Production ------------------------------------------------------------------- Function Function -> Type ident(Arguments) Type ident(Arguments) Type -> int int ident(Arguments) Arguments -> e int ident() Sentential Form Production ------------------------------------------------------------------- Function Function -> Type ident(Arguments) Type ident(Arguments) Type -> Type* Type* ident(Arguments) Type -> int int* ident(Arguments) Arguments -> ArgList int* ident(ArgList) ArgList -> Type ident, ArgList int* ident(Type ident, ArgList) ArgList -> Type ident int* ident(Type ident, Type ident) Type -> Type* int* ident(Type* ident, Type ident) Type -> Type* int* ident(Type** ident, Type ident) Type -> int int* ident(int** ident, Type ident) Type -> int int* ident(int** ident, int ident)

Most programming languages have a structure that can be described without contextual grammar. The analyzer's task is to take the program and grammar and determine how this program can be generated by the grammar.

As for LALR (1), unfortunately, the formal definition of LALR (1) is not trivial. I just finished learning how to build a compiler, and we were only able to talk about LALR (1), having previously given two lectures, discussing related parsing methods. If you want to get an official introduction to the material, my slides for parsing from the bottom up are available by shift , or shorten . shift means we are looking at another input token to get more information about which abbreviations to apply. to decrease means that we take a certain amount of tokens and non-terminals and reverse production in order to return to some non-terminal.

Here's the parsing of the line below:

 Workspace Input Action ----------------------------------------------------------------- int** ident(int* ident) Shift int ** ident(int* ident) Reduce Type -> int Type ** ident(int* ident) Shift Type* * ident(int* ident) Reduce Type -> Type* Type * ident(int* ident) Shift Type* ident(int* ident) Reduce Type -> Type* Type ident(int* ident) Shift Type ident (int* ident) Shift Type ident( int* ident) Shift Type ident(int * ident) Reduce Type -> int Type ident(Type * ident) Shift Type ident(Type* ident) Reduce Type -> Type* Type ident(Type ident) Shift Type ident(Type ident ) Reduce ArgList -> Type ident Type ident(ArgList ) Reduce Arguments -> ArgList Type ident(Arguments ) Shift Type ident(Arguments) Reduce Function -> Type ident(Arguments) Function ACCEPT

It is important to know about shifts and reductions, because when using a bison you will constantly encounter shifts / reductions of conflicts and reduce / reduce conflicts . These errors mean that the parser generator has determined that the parser may be in a state in which it cannot determine whether to shift or reduce or which of the two abbreviations it should perform. Refer to the bison manual for more details on how to handle this.

If you want to learn more about context-free grammars and parsing algorithms in general, there is an excellent book called Collapsible Methods: A Practical Guide, second edition by Grune and Jacobs, which is certainly best for the material I ever either seen. It covers all kinds of parsing algorithms, including many methods that are significantly more powerful than LALR (1), which are starting to be more widely used (e.g. GLR and Earley). I highly recommend this book - this is the main reason I am so good at parsing!

Hope this helps!

Studying a bison: what are context-free grammars and LALR (1)?

More articles: