How to efficiently implement the longest match in a lexer generator?

Question

How to efficiently implement the longest match in a lexer generator?

I am interested to know how to write a lexer generator like flex. I read "Compilers: Principles, Methods, and Tools" (the "dragon book"), and I have a basic understanding of how flex works.

My initial approach is this: the user will provide a hash map of regular expressions that map the regular expression to the token enumeration. The program will simply cycle through the regular expressions one at a time in the specified order and see if they match the beginning of the line (I could add ^at the beginning of each regular expression to implement this). If they do, I can add a token for this regular expression to the list of tokens for the program.

My first question is: is this the most efficient way to do this? Currently, I have to iterate over each regular expression, but theoretically I could build DFA from all regular expressions, combine them and do it more efficiently. However, there will be some overhead from creating this DFA.

My second question is: how would I implement the longest matching line breaker, like flex? those. I want to match ifaas an identifier, not a keyword iffollowed by a lettera. I do not see an effective way to do this with regex. I think I will have to go through all the regular expressions, try to match them, and if I have more than one match, take the longest result. However, if I converted regular expressions to DFA (that is, my own DFA data structure), then I could continue to jump through characters until there were no more transition edges in DFA. At this point, I can go through the acceptance state for the last time as the actual token match, as this should be the longest match.

DFA. , ( ) - ?

EDIT: , , , Rust: http://static.rust-lang.org/doc/master/regex/index.html

+4

language-agnostic programming-languages regex dfa lexical-analysis

gsingh2011 01 . '14 5:32

1

templatetypedef · Answer 1 · 2014-06-01T05:44:14+0000

, , . , (DFA ), , .

, , max-munch ( ), - . , . , , . .

, , , , .

, !

How to efficiently implement the longest match in a lexer generator?

More articles: