How is Python grammar used inside?

Question

How is Python grammar used inside?

I'm trying to get a deeper understanding of how Python works, and I was looking at the grammar shown at http://docs.python.org/3.3/reference/grammar.html .

I notice that he says that you will have to change parsermodule.c as well, but to be honest, I just don’t understand what is going on here.

I understand that grammar is a specification of how to read a language, but ... I can’t even say that it is written. It looks almost like Python, but it is not.

I am looking to better understand this specification and how it is used inside Python to ... do something. What depends on it (the answer is everything, but I mean specifically, what aspect of the "engine" processes it), what uses it, how is it related to compiling / running the script?

It's hard to believe that the whole language comes down to a specification of two pages ...

+12

python python-3.x language-design grammar

Aerovistae Oct 13 '13 at

source share

4 answers

Python grammar - like most others - is represented in the BNF or Backsu-Naur Form . Try to read, how to read, but the basic structure:

 <something> ::= (<something defined elsewhere> | [some fixed things]) [...]

This counts as <something> defined as something else or any of the fixed things repeated many times.

BNF is based on an almost 2,000-year-old format for describing the permissible structure of a language, it is incredibly concise and describes all the allowed structures in a given language, not necessarily all that make sense.

Example

Basic arithmetic can be described as:

 <simple arithmetic expression> ::= <numeric expr>[ ]...(<operator>[ ]...<numeric expr>|<simple arithmetic expression>) <numeric expr> ::= [<sign>]<digit>[...][.<digit>[...]] <sign> ::= +|- <operator> ::= [+-*/] <digit> ::= [0123456789]

It says that a simple arithmetic operation is, optionally, a signed number consisting of one or more digits, possibly with a decimal point and one or more subsequent digits, optionally followed by spaces, followed by exactly one of +-*/ , optional followed by spaces followed by either a number or another simple arithmetic operation, i.e. a number followed by, etc.

All basic arithmetic operations are described here and can be extended to include functions, etc. Please note that this allows you to use invalid operations that are valid syntax, for example: 22.34 / -0.0 is syntactic, although the result is invalid.

Sometimes you may know that operations are possible that you might not have thought of, for example: 56+-50 is a valid operation, such as 2*-10 , but 2*/3 not.

Note that SGML and XML / Schema are related but different methodologies for describing the structure of any language. YAML is another way to describe the allowed structures on a computer in certain languages.

Disclaimer: my BNF is a little rusty, so if I made any serious mistakes in the above apologies and please correct me.

+7

Steve Barnes Oct 13 '13 at 22:44

source share

This is basically an EBNF (Extended Backus-Naur Form).

+5

Amber Oct 13

source share

When you write a program in a language, the very first thing your interpreter / compiler needs to do to move from a sequence of characters to a real action is to translate that sequence of characters into a structure with a higher degree of complexity. To do this, first it breaks your program into a sequence of tokens expressing what each "word" represents. For example, construction

 if foo == 3: print 'hello'

will be converted to

 1,0-1,2: NAME 'if' 1,3-1,6: NAME 'foo' 1,7-1,9: OP '==' 1,10-1,11: NUMBER '3' 1,11-1,12: OP ':' 1,13-1,18: NAME 'print' 1,19-1,26: STRING "'hello'" 2,0-2,0: ENDMARKER ''

But note that even something like "if if if if" is correctly placed in tokens

 1,0-1,2: NAME 'if' 1,3-1,5: NAME 'if' 1,6-1,8: NAME 'if' 1,9-1,11: NAME 'if' 2,0-2,0: ENDMARKER ''

What follows tokenization is parsing into a higher-level structure that analyzes whether tokens are really perceived together, something that the last example does not, but the first does. To do this, the analyzer must recognize the actual value of the tokens (for example, if is the keyword, and foo is the variable), then build a tree of tokens, organize them in a hierarchy and see if this hierarchy really makes sense. Here you see the grammar in which you see. This grammar is in BNF, which is a notation for expressing constructs that a language can recognize. This grammar is digested by a program (e.g. bison) that has the magical ability to take this grammar and generate the actual C code, which does the hard work for you, usually by recognizing markers, organizing them, returning the parse tree to you, or telling you where the error is.

Short version: language development is the definition of tokens and how these tokens combine to give something meaningful. This is done using the grammar you use to generate the actual parser code using automated tools.

+3

Stefano Borini Dec 19 '13 at 10:00

source share

user2097752 · Accepted Answer · 2013-12-30 08:03

Grammar is used to describe all possible lines in a language. It is also useful in determining how the parser should parse the language.

In this grammar, they seem to use their own version of EBNF , where non-terminal is any lowercase word, and the terminal is all uppercase or surrounded by quotation marks. For example, NEWLINE is a terminal, arith_expr is non-terminal, and "if" is also a terminal. Any non-terminal can be replaced with anything to the right of the colon of its corresponding production rule. For example, if you look at the first rule:

single_input: NEWLINE | simple_stmt | composite_stmt NEWLINE

We can replace single_input with one of NEWLINE, simple_stmt or complex_stmt, followed by NEWLINE. Suppose we replaced it with "compound_stmt NEWLINE", then we will look for a production rule for complex_stmt:

compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated

and select which one we want to use and replace it with "composite_stmt" (keeping NEWLINE in it)

Suppose we wanted to generate a valid python program:

if 5 < 2 + 3 or not 1 == 5: raise

We could use the following output:

single_input
compound_stmt NEWLINE
if_stmt NEWLINE
'if' test ':' suite NEWLINE
'if' or_test ':' NEWLINE INDENT stmt stmt DEDENT NEWLINE
'if' and_test 'or' and_test ':' NEWLINE INDENT simple_stmt DEDENT NEWLINE
'if' not_test 'or' not_test ':' NEWLINE INDENT small_stmt DEDENT NEWLINE
'if' comparison 'or' 'not' not_test ':' NEWLINE INDENT flow_stmt DEDENT NEWLINE
'if' expr comp_op expr 'or' 'not' comparison ':' NEWLINE INDENT raise_stmt DEDENT NEWLINE
'if' arith_expr '<' arith_expr 'or' 'not' arith_expr comp_op arith_expr ':' NEWLINE INDENT 'raise' DEDENT NEWLINE
'if' term '<' term '+' term 'or' 'not' arith_expr == arith_expr ':' NEWLINE INDENT 'raise' DEDENT NEWLINE
'if' NUMBER '<' NUMBER '+' NUMBER 'or' 'not' NUMBER == NUMBER ':' NEWLINE INDENT 'raise' DEDENT NEWLINE

A few notes here, firstly, we must start with one of the non-terminals, which is indicated as the starting non-terminal. On this page they list them as single_input, file_input or eval_input. Secondly, the conclusion is completed after all the characters are terminal (hence the name). Thirdly, most often one substitution is done per line, for the sake of brevity, I did all possible replacements at once and began to skip steps towards the end.

Given the string in the language, how to find its output? This is the work of the parser. The parser draws the engineers a production sequence to first verify that it is indeed a valid string, and moreover, how it can be derived from grammar. It is worth noting that many grammars can describe one language. However, for a given line, its output, of course, will be different for each grammar. Therefore, technically we are writing a parser for grammar, not language. Some grammars are easier to parse, some grammars are easier to read / understand. It belongs to the first.

Nor does it indicate the whole language of what it looks like. Grammar says nothing about semantics.

If you're more interested in parsing and grammar, I recommend Grune, Jacobs - Parsing Techniques . It is free and good for self-study.

How is Python grammar used inside?

Example

More articles: