Well, it's pretty late, but here we go.
The lexical analyzer is often associated with grammars (and the BNF notation), but the two are actually slightly different.
Lexical analyzers turn symbols into tokens, which are somewhat processed by the "atoms" of the grammar, while parsers turn tokens into some kind of intermediate structure (usually this tree). Focusing only on the part of the lexical analyzer, you can think of it as a low-level input processing, especially since we process letters in words.
Since you already have a BNF grammar, you already know all the tokens (end words) that you are going to use, so include them in the list. The idea is to quickly decide which series of letters will be displayed for each item in the list. for instance
#, D, E, F, I, N, E, <whitespace> => #DEFINE #, D, O, C, U, M, E, N, T, <whitespace> => #DOCUMENT B, E, G, I, N, <whitespace> => BEGIN E, N, D, <whitespace> => END
There are several issues that come up with parsing:
First you have a lot of comparison. The first character read can be "#", and if so, then you still have more than 20 items that can match. This means that you need to continue the match until the next character, which, if it were "D", would still mean that there are two possible matches: "#DEFINE" and "#DOCUMENT".
Secondly, if you have words like "#BEGIN" and "#BEGINNING" after you have processed "#BEGIN", you cannot decide between them until you capture the next character. Capturing the next character in a system that believes that âconsumingâ a character complicates the processing of the next token. It may take a peek or look ahead, but this adds complexity to the logic to decide which tokens to generate.
Thirdly, you have a token with the wild-card symbolism. This token can match almost any, so you need to check it on all your other tokens to make sure your token generation logic will always know which token it should generate.
Ideally, the Token Generator (Lexer) is not dependent on any parsing to âknowâ the next token; however, there are languages ââthat are complex enough that the parser gives âhintsâ to Lexer. Avoid such systems for more complex compiler implementations; unfortunately, in some existing languages ââit is not always possible to build this.
So, know that you have an idea what to do (what you probably already had in a sense), how do you do it?
Well, you need some kind of index to track those characters that you consume (which are fully converted to tokens), so you will not accidentally give the character a double effect on the token flow. You will need a second pointer to look ahead if you are looking to look ahead, and the likelihood that you will want to limit the amount of look ahead (to make logic less difficult).
Then you need an unknown number of data structures (called tokens). Although this is not always necessary, I recommend keeping track of the current line number, the starting index of the character, the number of the ending line, and the index of the end of the character in the token. This makes debugging a lot easier. In addition, itâs nice to âgrabâ a substring in a token. You can call it what you want, but some people call it the âimageâ of the token.
Naturally, if your parser can distinguish between different types of tokens, then you should save the type of this token in (or with) a marker in some ways. Sometimes a person has the concept of âmeaningâ of a token, and it can also be saved.
After some effort, you can click the character string in Lexer and release a stream of tokens. Good luck.