Using PLY to parse SQL statements

Question

Using PLY to parse SQL statements

I know there are other tools for parsing SQL statements, but I am deploying my own for educational purposes. I'm stuck in my grammar right now. If you can quickly spot the error, please let me know.

SELECT = r'SELECT' FROM = r'FROM' COLUMN = TABLE = r'[a-zA-Z]+' COMMA = r',' STAR = r'\*' END = r';' t_ignore = ' ' #ignores spaces statement : SELECT columns FROM TABLE END columns : STAR | rec_columns rec_columns : COLUMN | rec_columns COMMA COLUMN

When I try to parse an expression of type SELECT a FROM b; I get syntax error in FROM token ... Any help is appreciated!

(Change) Code:

 #!/usr/bin/python import ply.lex as lex import ply.yacc as yacc tokens = ( 'SELECT', 'FROM', 'WHERE', 'TABLE', 'COLUMN', 'STAR', 'COMMA', 'END', ) t_SELECT = r'select|SELECT' t_FROM = r'from|FROM' t_WHERE = r'where|WHERE' t_TABLE = r'[a-zA-Z]+' t_COLUMN = r'[a-zA-Z]+' t_STAR = r'\*' t_COMMA = r',' t_END = r';' t_ignore = ' \t' def t_error(t): print 'Illegal character "%s"' % t.value[0] t.lexer.skip(1) lex.lex() NONE, SELECT, INSERT, DELETE, UPDATE = range(5) states = ['NONE', 'SELECT', 'INSERT', 'DELETE', 'UPDATE'] current_state = NONE def p_statement_expr(t): 'statement : expression' print states[current_state], t[1] def p_expr_select(t): 'expression : SELECT columns FROM TABLE END' global current_state current_state = SELECT print t[3] def p_recursive_columns(t): '''recursive_columns : recursive_columns COMMA COLUMN''' t[0] = ', '.join([t[1], t[3]]) def p_recursive_columns_base(t): '''recursive_columns : COLUMN''' t[0] = t[1] def p_columns(t): '''columns : STAR | recursive_columns''' t[0] = t[1] def p_error(t): print 'Syntax error at "%s"' % t.value if t else 'NULL' global current_state current_state = NONE yacc.yacc() while True: try: input = raw_input('sql> ') except EOFError: break yacc.parse(input)

+6

python sql parsing context-free-grammar ply

sampwing Sep 08 '11 at 10:57

source share

1 answer

Joe holloway · Accepted Answer · 2011-09-09T04:42:08+0000

I think your problem is that your regular expressions for t_TABLE and t_COLUMN also match your reserved words (SELECT and FROM). In other words, SELECT a FROM b; points to something like COLUMN COLUMN COLUMN COLUMN END (or some other ambiguous tokenization), and this does not correspond to any of your productions, so you get a syntax error.

As a quick health check, modify these regular expressions so that they exactly match what you type as follows:

 t_TABLE = r'b' t_COLUMN = r'a'

You will see that the syntax is SELECT a FROM b; passes because the regular expressions 'a' and 'b' do not match your reserved words.

And another problem is that the regular expressions for TABLE and COLUMN also overlap, so lexer cannot marx without ambiguity regarding these tokens.

Here's a thin but relevant section of the PLY documentation . I'm not sure the best way to explain this, but the trick is that the tokenization goes through first, so it cannot really use the context from your production rules to find out if it ran into the TABLE token or the COLUMN token. You need to generalize them to some ID token, and then cut the contents during the session.

If I had a bit of energy, I would try to work more efficiently with your code and provide the actual solution in the code, but I think, since you already said that this is an exercise that, perhaps, you will be happy, I am pointing in the right direction .

Using PLY to parse SQL statements

More articles: