Assuming you have a working regex_match regular expression that returns a boolean value (True if the string matches the regular expression). Firstly, you need to have an ordered list of tokens_regex tokens (with a regular expression for each), and the order, important how , will determine the priority .
One algorithm may be (this is not necessarily the only one):
- Write the
next_token procedure, which takes the string and returns the first token, its value and the remaining string (or - if the illegal / ignore character is None, the offensive character and the remaining string). Note: this must respect the priority and must find the longest token. - Write a
lex procedure that calls next_token recursively.
.
Something like this (written in Python):
tokens_regex = [ (TOKEN_NAME, TOKEN_REGEX),...] #order describes precedence def next_token( remaining_string ): for t_name, t_regex in tokens_regex: # check over in order of precedence for i in xrange( len(remaining_string), 0, -1 ): #check longest possibilities first (there may be a more efficient method). if regex_match( remaining_string[:i], t_regex ): return t_name, remaining_string[:i], remaining_string[i:] return None, remaining_string[0], remaining_string[1:] #either an ignore or illegal character def lex( string ): tokens_so_far = [] remaining_string = string while len(remaining_string) > 0: t_name, t_value, string_remaining = next_token(remaining_string) if t_name is not None: tokens_so_far.append(t_name, t_value) #elif not regex_match(t_value,ignore_regex): #check against ignore regex, if not in it add to an error list/illegal characters return tokens_so_far
Some things you need to add to improve your lexer are: ignore regular expression, error lists, and location / line numbers (for these errors or for tokens).
Good luck And good luck creating the parser :).
source share