The issue of tokenizer efficiency

Question

The issue of tokenizer efficiency

I am writing the front end of a compiler for a project, and I'm trying to figure out what is the best method tokenize source code. I can not choose between two ways:

1) the tokenizer reads all the markers:

bool Parser::ReadAllTokens() { Token token; while( m_Lexer->ReadToken( &token ) ) { m_Tokens->push_back( token ); token.Reset(); // reset the token values.. } return !m_Tokens->empty(); }

and then the parsing phase begins, working on the m_ Tokens list. Thus, the getNextToken (), peekNextToken (), and ungetToken () methods are relatively easy to implement by an iterator, and the parsing code is well written and understandable (getNextToken () is not broken ie:

  getNextToken(); useToken(); getNextToken(); peekNextToken(); if( peeked is something ) ungetToken(); .. ..

)

2) the parsing phase begins and, if necessary, a token is created and used (the code seems not so clear)

What is the best method and why? and efficiency? in advance for replies

+4

c ++ compiler-construction tokenize parsing

Salv0 Jan 19 '11 at 13:03

source share

4 answers

Better use something like Boost :: Spirit for tokenise. Why reinvent the wheel?

+2

T33c Jan 19 '11 at 13:09

source share

Your method (1), as a rule, is redundant - it is not required to tint the entire file before parsing it.

A good way is to implement a buffered tokenizer that will store tokens in the list that were forged or unget, and which will consume an element of this list to "get" or read tokens from the file when the list is empty (a la FILE *).

+2

Noe Jan 19 '11 at 13:13

source share

The first method is better, since you can also understand the code after 3 months ...

+1

swegi Jan 19 '11 at 13:12

source share

Jörgen sigvardsson · Accepted Answer · 2011-01-19T13:11:37+0000

Traditionally, compiler building classes teach you to read markers one by one when you parse. The reason for this is that in those days, memory resources were insufficient. You had kilobytes at your disposal, not gigabytes, like today.

Having said that, I do not want to recommend that you read all the tokens in advance, and then analyze from the list of your tokens. The input is of arbitrary size. If you have too much memory, the system will slow down. Since it looks like you only need one token in the view, I would read one from the input stream. The operating system will buffer and cache the input stream for you, so it will be fast enough for most purposes.

The issue of tokenizer efficiency

More articles: