Is it possible in ocamllex to define a rule that looks forward to the next character without consuming it?

I use ocamllex to write lexer for a scripting language, but I have to deal with a conflict with my comment rule.

I want my command arguments not to be sorted if they contain only alphanumeric characters and slashes "/". For instance:

echo "quoted argument !@ #%" /this/second/argument/is/unquoted 

Also, one of my preliminary requests is C ++ style comments using "//"

 //this is a comment echo hello world 

The problem that arises is things like

 echo foo//comment 

I would like my lexer to create the "foo" token, as well as leaving the "//" untouched, which will be used the next time I ask lexer for the token. Is it possible? . The reason for this is that it is possible that the input buffer has not yet reached the end of the comment, and I would rather return the β€œfoo” token immediately than an unnecessary block, trying to willingly consume the comment.

+5
source share
1 answer

Below is a small lexer that matches only echo , quotes with non-quotation marks, comments and displays the final markers:

 { type token = NEWLINE | ECHO | QUOTED of string | UNQUOTED of string | COMMENT of string exception Eof type state = CODE | LINE_COMMENT let state = ref CODE } let newline = '\n' let alphanum = [ 'A'-'Z' 'a'-'z' '0'-'9' '_' ] let comment_line = "//"([^ '\n' ]+) let space = [ ' ' '\t' ] let quoted = '"'([^ '"' ]+)'"' let unquoted = ('/'?(alphanum+'/'?)+) rule code = parse space+ { code lexbuf } | newline { code lexbuf } | "echo" { ECHO } | quoted { QUOTED (Lexing.lexeme lexbuf) } | "//" { line_comment "" lexbuf } | ('/'|alphanum+) { unquoted (Lexing.lexeme lexbuf) lexbuf } | eof { raise Eof } and unquoted buff = parse newline { UNQUOTED buff } | "//" { state := LINE_COMMENT; if buff = "" then line_comment "" lexbuf else UNQUOTED buff } | ('/'|alphanum+) { unquoted (buff ^ Lexing.lexeme lexbuf) lexbuf } | space+ { UNQUOTED buff } | eof { raise Eof } and line_comment buff = parse newline { state := CODE; COMMENT buff } | _ { line_comment (buff ^ Lexing.lexeme lexbuf) lexbuf } { let lexer lb = match !state with CODE -> code lb | LINE_COMMENT -> line_comment "" lb let _ = try let lexbuf = Lexing.from_channel stdin in while true do let () = match lexer lexbuf with ECHO -> Printf.printf "ECHO\n" | QUOTED s -> Printf.printf "QUOTED(%s)\n" s | UNQUOTED s -> Printf.printf "UNQUOTED(%s)\n" s | COMMENT s -> Printf.printf "COMMENT(%s)\n" s | NEWLINE -> Printf.printf "\n" in flush stdout done with Eof -> exit 0 } 

This is the trick I used in my project to overcome the same limitation in ocamllex (compared to the original C lex program, which allows one match pattern to look ahead). In principle, he shares the ambiguous rules in their individual radicals and accordingly switches the lexer to another parser. It also keeps track of the currently used parser and the next entry point.

In your situation, the only states you need to monitor are the default ( CODE ) and comment mode ( LINE_COMMENT ). This can be extended to support other conditions, if necessary.

+5
source

Source: https://habr.com/ru/post/1207854/


All Articles