I am interested to know how to write a lexer generator like flex. I read "Compilers: Principles, Methods, and Tools" (the "dragon book"), and I have a basic understanding of how flex works.
My initial approach is this: the user will provide a hash map of regular expressions that map the regular expression to the token enumeration. The program will simply cycle through the regular expressions one at a time in the specified order and see if they match the beginning of the line (I could add ^at the beginning of each regular expression to implement this). If they do, I can add a token for this regular expression to the list of tokens for the program.
My first question is: is this the most efficient way to do this? Currently, I have to iterate over each regular expression, but theoretically I could build DFA from all regular expressions, combine them and do it more efficiently. However, there will be some overhead from creating this DFA.
My second question is: how would I implement the longest matching line breaker, like flex? those. I want to match ifaas an identifier, not a keyword iffollowed by a lettera. I do not see an effective way to do this with regex. I think I will have to go through all the regular expressions, try to match them, and if I have more than one match, take the longest result. However, if I converted regular expressions to DFA (that is, my own DFA data structure), then I could continue to jump through characters until there were no more transition edges in DFA. At this point, I can go through the acceptance state for the last time as the actual token match, as this should be the longest match.
DFA. , ( ) - ?
EDIT: , , , Rust: http://static.rust-lang.org/doc/master/regex/index.html