Lexical parser (java) for HTML Markdown source code

I don’t even know where to start writing a lexical analyzer by nature. I wrote BNF grammar rules for the Markdown language (in particular HTML) based on the rules and specifications that were provided to me, so none of them need to be added. Now I have to design and implement a lexical analyzer by nature, which breaks the tokens of the source file in my Markdown language. Here is my BNF GRAMMAR:

Terminals:

#DOCUMENT BEGIN, #DOCUMENT END #HEAD BEGIN, #HEAD END, #TITLE BEGIN, #TITLE END, #PARAGRAPH BEGIN, #PARAGRAPH END, #BOLD BEGIN, #BOLD END, #ITALICS BEGIN, #ITALICS END, #LIST BEGIN, #LIST END, #ITEM BEGIN, #ITEM END, #LINK BEGIN, #TEXT, #ADDRESS, #LINK END, #DEFINE BEGIN, #NAME, #VALUE, #DEFINE END, #USE BEGIN, #USE END 

Please note that these terminals are not case sensitive.

non-terminals:

 <document> ::= #DOCUMENT BEGIN <macro-‐define> <head> <body> #DOCUMENT END <head> ::= #HEAD BEGIN <title> #HEAD END | Δ <title> ::= #TITLE BEGIN <text> #TITLE END | Δ <body> ::= <inner-‐text> <body> | <paragraph> <body> | <bold> <body> | <italics> <body> | <list> <body> | Δ <paragraph> ::= #PARAGRAPH BEGIN <macro-‐define> <inner-‐paragraph> #PARAGRAPH END <inner-‐paragraph> ::= <inner-‐text> <inner-‐paragraph> | <bold> <inner-‐paragraph> | <italics> <inner-‐paragraph> | <list> <inner-‐paragraph> | Δ <inner-‐text> ::= <macro-‐use> <inner-‐text> | <text> <inner-‐text> | Δ <macro-‐define> ::= #DEFINE BEGIN #NAME <text> #VALUE <body> #DEFINE END <macro-‐define> | Δ <macro-‐use> ::= #USE BEGIN <text> #USE END | Δ <bold> ::= #BOLD BEGIN <macro-‐define> <inner-‐text> #BOLD END <italics> ::= #ITALICS BEGIN <macro-‐define> <inner-‐text> #ITALICS END <link> ::= #LINK BEGIN #TEXT <text> #ADDRESS <text> #LINK END <list> ::= #LIST BEGIN #ITEM BEGIN <macro-‐define> <inner-‐list> #ITEM END <list-‐items> #LIST END <list-‐items> ::= #ITEM BEGIN <macro-‐define> <inner-‐list> #ITEM END <list-‐items> | Δ <inner-‐list> ::= | <bold> <inner-‐list> | <italics> <inner-‐list> | <list><inner-‐list> | <inner-‐text> <inner-‐list> | Δ <text> ::= Any plain text | Δ 

We can assume that HTML characters such as "<", ">", "&" and "/" do not appear in any of the text in the source file. We can also assume that "#" appears only before one of our Markdown annotations (for example, #DOCUMENT). I think it would be better to have separate Java classes for representing token objects, such as: DocumentBegin, DocumentEnd, ParagraphBegin, ParagraphEnd, etc. Any lexical errors encountered (for example, #DOC BEGIN) should be reported as output to the console with the same amount of error information. The compiler should exit after the first error. If an error occurs, the output file should not be created.

My problem is that I know what the lexical analyzer should do, but to be honest, I don’t know where to start with coding / implementation. If you need more explanation about what the problem is asking, just ask, I can do my best to explain. This was one part of a large project that we had for my class. I could not complete this part and lost a lot of points, but now I just need to understand this, so as soon as we are tested on this, I will not be lost.

+4
source share
2 answers

Well, it's pretty late, but here we go.

The lexical analyzer is often associated with grammars (and the BNF notation), but the two are actually slightly different.

Lexical analyzers turn symbols into tokens, which are somewhat processed by the "atoms" of the grammar, while parsers turn tokens into some kind of intermediate structure (usually this tree). Focusing only on the part of the lexical analyzer, you can think of it as a low-level input processing, especially since we process letters in words.

Since you already have a BNF grammar, you already know all the tokens (end words) that you are going to use, so include them in the list. The idea is to quickly decide which series of letters will be displayed for each item in the list. for instance

 #, D, E, F, I, N, E, <whitespace> => #DEFINE #, D, O, C, U, M, E, N, T, <whitespace> => #DOCUMENT B, E, G, I, N, <whitespace> => BEGIN E, N, D, <whitespace> => END 

There are several issues that come up with parsing:

First you have a lot of comparison. The first character read can be "#", and if so, then you still have more than 20 items that can match. This means that you need to continue the match until the next character, which, if it were "D", would still mean that there are two possible matches: "#DEFINE" and "#DOCUMENT".

Secondly, if you have words like "#BEGIN" and "#BEGINNING" after you have processed "#BEGIN", you cannot decide between them until you capture the next character. Capturing the next character in a system that believes that “consuming” a character complicates the processing of the next token. It may take a peek or look ahead, but this adds complexity to the logic to decide which tokens to generate.

Thirdly, you have a token with the wild-card symbolism. This token can match almost any, so you need to check it on all your other tokens to make sure your token generation logic will always know which token it should generate.

Ideally, the Token Generator (Lexer) is not dependent on any parsing to “know” the next token; however, there are languages ​​that are complex enough that the parser gives “hints” to Lexer. Avoid such systems for more complex compiler implementations; unfortunately, in some existing languages ​​it is not always possible to build this.

So, know that you have an idea what to do (what you probably already had in a sense), how do you do it?

Well, you need some kind of index to track those characters that you consume (which are fully converted to tokens), so you will not accidentally give the character a double effect on the token flow. You will need a second pointer to look ahead if you are looking to look ahead, and the likelihood that you will want to limit the amount of look ahead (to make logic less difficult).

Then you need an unknown number of data structures (called tokens). Although this is not always necessary, I recommend keeping track of the current line number, the starting index of the character, the number of the ending line, and the index of the end of the character in the token. This makes debugging a lot easier. In addition, it’s nice to “grab” a substring in a token. You can call it what you want, but some people call it the “image” of the token.

Naturally, if your parser can distinguish between different types of tokens, then you should save the type of this token in (or with) a marker in some ways. Sometimes a person has the concept of “meaning” of a token, and it can also be saved.

After some effort, you can click the character string in Lexer and release a stream of tokens. Good luck.

+1
source

The best (known only one that I know) lexical analyzer I have found for this in Java is called JFlex. We used it at the University to designate tokens, and I used it for commercial use to highlight the syntax for domain languages ​​in applications.

JFlex Lexical Analyzer

http://jflex.de/

Parser Cup

http://www2.cs.tum.edu/projects/cup/

A bit about LALR (1) Parsers

http://en.wikipedia.org/wiki/LALR_parser

If you need examples (for example, code code), write me and I will send you some notes. A quick google didn’t show anything too useful, although I’m sure that some of the university sites (like Princeton) may have something.

Greetings

John

0
source

Source: https://habr.com/ru/post/1446539/


All Articles