Direct coding and table lexer?

Question

Direct coding and table lexer?

I am new to the world of compiler building, I want to know what are the differences between a straightforward and a table driven parser by lexer?

Please use a simple source code example if possible.

Thanks.

Edit:

in "Compiler Developer," the author divided the lexers into three (3) types: table-driven, direct-encoded, and manual-encoded.

+6

compiler-construction lexer

Ahmed T. Ali Jan 4 '15 at 7:49

source share

1 answer

Martin Törnwall · Accepted Answer · 2015-01-04T13:38:56+0000

~~I assume that you are using “direct coding” to mean a handwritten lexer, not one created as the output of a lexer generator.~~ ~~In this case ...~~ (See below.)

... a table-driven lexer is a (usually automatically generated) program that uses some sort of lookup table to determine what action to take. Consider a finite state machine that matches the regular expression ab*a (intentionally not collapsed):

DFA for ab * a

If we limited ourselves to considering the characters "a" and "b", we could encode it in the lookup table as follows:

 #define REJECT -1 /* This table encodes the transitions for a given state and character. */ const int transitions[][] = { /* In state 0, if we see an a then go to state 1 (the 1). * Otherwise, reject input. */ { /*a*/ 1, /*b*/ REJECT }, { /*a*/ 2, /*b*/ 3 }, { /*a*/ -1, /*b*/ -1 }, /* Could put anything here. */ { /*a*/ 2, /*b*/ 3 } }; /* This table determines, for each state, whether it is an accepting state. */ const int accept[] = { 0, 0, 1, 0 };

Now we only need a driver that actually uses the table:

 int scan(void) { char ch; int state = 0; while (!accept[state]) { ch = getchar() - 'a'; /* Adjust so that a => 0, b => 1. */ if (transitions[state][ch] == REJECT) { fprintf(stderr, "invalid token!\n"); return 0; /* Fail. */ } else { state = transitions[state][ch]; } } return 1; /* Success! */ }

Of course, in a real lexer, each token will have a corresponding receiving state, and you will have to somehow modify the accept table to contain the token identifier. I want to emphasize two things:

A table-driven lexer does not necessarily work at the DFA state level.
I do not recommend manually writing table-driven lexers, as this is a tedious and error-prone process.

A handwritten (due to lack of a better name) lexer usually does not use a lookup table. Suppose we need a lexer for a simple Lisp-like language that has parentheses, identifiers, and decimal integers:

 enum token { ERROR, LPAREN, RPAREN, IDENT, NUMBER }; enum token scan(void) { /* Consume all leading whitespace. */ char ch = first_nonblank(); if (ch == '(') return LPAREN; else if (ch == ')') return RPAREN; else if (isalpha(ch)) return ident(); else if (isdigit(ch)) return number(); else { printf("invalid token!\n"); return ERROR; } } char first_nonblank(void) { char ch; do { ch = getchar(); } while (isspace(ch)); return ch; } enum token ident(void) { char ch; do { ch = getchar(); } while (isalpha(ch)); ungetc(ch, stdin); /* Put back the first non-alphabetic character. */ return IDENT; } enum token number(void) { char ch; do { ch = getchar(); } while (isdigit(ch)); ungetc(ch, stdin); /* Put back the first non-digit. */ return NUMBER; }

Like the table driven lexer example, this is not complete. First, for storing characters that are part of a token, such as IDENT and NUMBER , it needs some buffering. On the other hand, it does not handle EOF particularly gracefully. But I hope you get the gist of it.

Change Based on the definition in Compiler Engineering, a direct-encoded lexer is basically a hybrid of the two. Instead of using a table, we use code labels to represent states. Let's see how it will look using the same DFA as before.

 int scan(void) { char ch; state0: ch = getchar(); if (ch == 'a') goto state1; else { error(); return 0; } state1: ch = getchar(); if (ch == 'a') goto state2; else if (ch == 'b') goto state3; else { error(); return 0; } state2: return 1; /* Accept! */ state3: ch = getchar(); if (ch == 'a') goto state2; else if (ch == 'b') goto state3; /* Loop. */ else { error(); return 0; } }

In my personal experience, the “best” approach to writing lexers is the handwritten approach described above. It works on every platform, in every language you don’t need to learn ancient tools such as lex, and perhaps most importantly, you have the flexibility to extend lexer with features that are difficult or impossible to implement with the tool. For example, perhaps you want your lexer to understand the Python-esque indentaiton block, or perhaps you need to dynamically extend certain token classes.

Direct coding and table lexer?

More articles: