Force CL-Lex read whole words

I use CL-Lex to implement lexer (as input for CL-YACC), and my language has several keywords, such as "let" and "in". However, although lexer recognizes such keywords, it does too much. When it finds words like “init”, it returns the first token as IN, while it should return the “CONST” token for the word “init”.

This is a simple version of lexer:

(define-string-lexer lexer (...) ("in" (return (values :in $@ ))) ("[az]([az]|[AZ]|\_)" (return (values :const $@ )))) 

How to get lexer to read the whole word completely before any spaces appear?

+4
source share
2 answers

This is both a Kaz bug fix and a confidence vote for the OP.

In his original answer, Kaz claims Unix lex priority order exactly back. From lex documentation:

Lex can handle ambiguous specifications. When more than one expression can match the current input, Lex selects the following:

  • The longest match is preferable.

  • Among the rules that correspond to the same number of characters, the specified rule is first preferable.

In addition, Kaz is mistaken in criticizing the OP solution using Perl-regex word-boundary matching. As it happens, you are allowed (without excruciating guilt) to match the words in any way that your lexer generator supports. CL-LEX uses Perl regular expressions, which use \b as a convenient syntax for the more cumbersome lex approximation:

 %{ #include <stdio.h> %} WC [A-Za-z'] NW [^A-Za-z'] %start INW NIW {WC} { BEGIN INW; REJECT; } {NW} { BEGIN NIW; REJECT; } <INW>a { printf("'a' in wordn"); } <NIW>a { printf("'a' not in wordn"); } 

Other things being equal, the search for a unique match for his words is probably better than an alternative.

Despite the fact that Kaz wanted to pat him, OP answered his question correctly, having come up with a solution that uses the flexibility of his lexer generator.

+8
source

In the lexer example above, there are two rules, both of which correspond to a sequence of exactly two characters. Moreover, they have common coincidences (the language corresponding to the second is a strict superset of the first).

In classic Unix lex , if two rules match the same input length, priority is given to the rule that occurs first in the specification. Otherwise, the maximum possible match dominates.

(Although without RTFM, I cannot say that this happens in CL-LEX, this makes a plausible hypothesis about what happens in this case.)

It looks like you are missing the Kleene regex operator to match the longer token in the second rule.

+1
source

Source: https://habr.com/ru/post/1403800/


All Articles