How can I implement lexer, given that I have already implemented the basic regex pattern?

I am trying to implement lexer for fun. I already implemented the basic regex pattern (first converting the pattern to NFA and then to DFA). Now I do not know how to act.

My lexer will take a list of tokens and their corresponding regular expressions. What is the general algorithm used to create lexer?

I was thinking about (OR) in the whole regex, but then I can not determine which specific token was matched. Even if I extend the regex module to return a pattern matching when the match is successful, how do I implement lookahead in the match?

+4
source share
2 answers

Assuming you have a working regex_match regular expression that returns a boolean value (True if the string matches the regular expression). Firstly, you need to have an ordered list of tokens_regex tokens (with a regular expression for each), and the order, important how , will determine the priority .

One algorithm may be (this is not necessarily the only one):

  • Write the next_token procedure, which takes the string and returns the first token, its value and the remaining string (or - if the illegal / ignore character is None, the offensive character and the remaining string). Note: this must respect the priority and must find the longest token.
  • Write a lex procedure that calls next_token recursively.

.

Something like this (written in Python):

 tokens_regex = [ (TOKEN_NAME, TOKEN_REGEX),...] #order describes precedence def next_token( remaining_string ): for t_name, t_regex in tokens_regex: # check over in order of precedence for i in xrange( len(remaining_string), 0, -1 ): #check longest possibilities first (there may be a more efficient method). if regex_match( remaining_string[:i], t_regex ): return t_name, remaining_string[:i], remaining_string[i:] return None, remaining_string[0], remaining_string[1:] #either an ignore or illegal character def lex( string ): tokens_so_far = [] remaining_string = string while len(remaining_string) > 0: t_name, t_value, string_remaining = next_token(remaining_string) if t_name is not None: tokens_so_far.append(t_name, t_value) #elif not regex_match(t_value,ignore_regex): #check against ignore regex, if not in it add to an error list/illegal characters return tokens_so_far 

Some things you need to add to improve your lexer are: ignore regular expression, error lists, and location / line numbers (for these errors or for tokens).

Good luck And good luck creating the parser :).

+5
source

I did almost the same thing. The way I did this was to combine all the expressions in one fairly large NFA and convert the same into one DFA. At the same time, keep track of conditions that previously assumed states in each respective original DFA and their priority.

The generated DFA will have many states that accept states. You run this DFA until it receives a character for which it does not have matching transitions. If the DFA is in the receiving state, then you can see which of your original NFAs that had the receiving state in them. The one that has the highest priority is the token you are about to return.

This does not handle regular expressions. In any case, they are usually not needed for lexer to work. This will be the work of the parser.

Such a lexer runs at the same speed as a single regular expression, since there is only one DFA for it. You can omit the NFA transform as a whole to build the algorithm faster, but slower to run. The algorithm is basically the same.

The source code for lexer that I wrote is freely available on github, should you see how I did it.

+2
source

Source: https://habr.com/ru/post/1432121/


All Articles