A context-free grammar is a description of a set of strings that is strictly more powerful than regular expressions, but is still easily processed by the machine. A more formal context-free grammar consists of four things:
A set of terminal characters that are elements of the generated strings. For a bison analyzer, this is usually a set of tokens created by the scanner. For grammar for a natural language such as English, this can be a collection of all English words.
A set of nonterminal characters . Intuitively, a nonterminal symbol is something like a part of speech, such as a “noun” or “verb”.
A set of productions. . Each work tells how to replace a nonterminal symbol with another set of terminals and nonterminals. For example, the product Variable -> Type Name says that if we see the Variable nonterminal, we can replace it with the Type Name string.
The character start < , which is the nonterminal from which the output begins.
As an example, consider this simple context-free grammar for declaring C-style functions:
Function -> Type ident(Arguments) Type -> int Type -> Type * Arguments -> e (the empty string) Arguments -> ArgList ArgList -> Type ident ArgList -> Type ident, ArgList
Here is the start character of Function . Given this grammar, we can create declarations of functions of style C by repeatedly choosing a nonterminal symbol and replacing it with one of the right sides of the corresponding product. At each step, the line that we have created so far is called a conditional form . For example, here are a few different function declarations that can be obtained from the grammar above:
Sentential Form Production ------------------------------------------------------------------- Function Function -> Type ident(Arguments) Type ident(Arguments) Type -> int int ident(Arguments) Arguments -> e int ident() Sentential Form Production ------------------------------------------------------------------- Function Function -> Type ident(Arguments) Type ident(Arguments) Type -> Type* Type* ident(Arguments) Type -> int int* ident(Arguments) Arguments -> ArgList int* ident(ArgList) ArgList -> Type ident, ArgList int* ident(Type ident, ArgList) ArgList -> Type ident int* ident(Type ident, Type ident) Type -> Type* int* ident(Type* ident, Type ident) Type -> Type* int* ident(Type** ident, Type ident) Type -> int int* ident(int** ident, Type ident) Type -> int int* ident(int** ident, int ident)
Most programming languages have a structure that can be described without contextual grammar. The analyzer's task is to take the program and grammar and determine how this program can be generated by the grammar.
As for LALR (1), unfortunately, the formal definition of LALR (1) is not trivial. I just finished learning how to build a compiler, and we were only able to talk about LALR (1), having previously given two lectures, discussing related parsing methods. If you want to get an official introduction to the material, my slides for parsing from the bottom up are available by shift , or shorten . shift means we are looking at another input token to get more information about which abbreviations to apply. to decrease means that we take a certain amount of tokens and non-terminals and reverse production in order to return to some non-terminal.
Here's the parsing of the line below:
Workspace Input Action ----------------------------------------------------------------- int** ident(int* ident) Shift int ** ident(int* ident) Reduce Type -> int Type ** ident(int* ident) Shift Type* * ident(int* ident) Reduce Type -> Type* Type * ident(int* ident) Shift Type* ident(int* ident) Reduce Type -> Type* Type ident(int* ident) Shift Type ident (int* ident) Shift Type ident( int* ident) Shift Type ident(int * ident) Reduce Type -> int Type ident(Type * ident) Shift Type ident(Type* ident) Reduce Type -> Type* Type ident(Type ident) Shift Type ident(Type ident ) Reduce ArgList -> Type ident Type ident(ArgList ) Reduce Arguments -> ArgList Type ident(Arguments ) Shift Type ident(Arguments) Reduce Function -> Type ident(Arguments) Function ACCEPT
It is important to know about shifts and reductions, because when using a bison you will constantly encounter shifts / reductions of conflicts and reduce / reduce conflicts . These errors mean that the parser generator has determined that the parser may be in a state in which it cannot determine whether to shift or reduce or which of the two abbreviations it should perform. Refer to the bison manual for more details on how to handle this.
If you want to learn more about context-free grammars and parsing algorithms in general, there is an excellent book called Collapsible Methods: A Practical Guide, second edition by Grune and Jacobs, which is certainly best for the material I ever either seen. It covers all kinds of parsing algorithms, including many methods that are significantly more powerful than LALR (1), which are starting to be more widely used (e.g. GLR and Earley). I highly recommend this book - this is the main reason I am so good at parsing!
Hope this helps!