How to create a regular expression without a specific group of letters in lex

Question

How to create a regular expression without a specific group of letters in lex

I recently started learning lex, so I practiced and decided to create a program that recognizes the declaration of a normal variable. (Sorting)

This is my code:

%{ #include "stdio.h" %} dataType "int"|"float"|"char"|"String" alphaNumeric [_\*a-zA-Z][0-9]* space [ ] variable {dataType}{space}{alphaNumeric}+ %option noyywrap %% {variable} printf("ok"); . printf("incorect"); %% int main(){ yylex(); }

In some cases, when the output should return ok

 int var3 int _varR3 int _AA3_

And if I enter: int float as input, it returns ok , which is incorrect, because these are both reserved words.

So my question is, what should I change so that my expression ignores the words 'dataType' after a space?

Thanks.

+5

regex flex-lexer lex

maspinu Dec 05 '15 at 17:32

source share

2 answers

Preliminary review. As a rule, the detection of the structure you specified is not performed at the lexing stage, but at the analysis stage. For example, for yacc / bison, you should have a rule that matches only a type marker, followed by an identifier token.

To achieve this with lex / flex, you might consider playing with the negation (^) and end context (/) operators. Or...

If you use flex, perhaps just enclosing all your regular expressions with parentheses and passing the -l flag will do the trick. Note that there are several differences between lex and flex, as described in the Flex manual .

+2

Leandro TC Melo Dec 6 '15 at 13:40

source share

rici · Accepted Answer · 2015-12-05T19:22:25+0000

This is really not a way to solve this particular problem.

The usual way to do this is to write separate template rules for recognizing keywords and variable names. (Plus a template rule to ignore spaces.) This means that the tokenizer will return two tokens for entering int var3 . Recognizing that two tokens are a valid announcement is the responsibility of the parser, which will repeatedly call the tokenizer to analyze the token stream.

However, if you really want to recognize two words as one token, this is certainly possible. (F) lex does not allow negative images in regular expressions, but you can use the pattern matching rule to capture erroneous tokens.

For example, you can do something like this:

 dataType int|float|char|String id [[:alpha:]_][[:alnum:]_]* %% {dataType}[[:white:]]+{dataType} { puts("Error: two types"); } {dataType}[[:white:]]+{id} { puts("Valid declaration"); } /* ... more rules ... */

The above uses Posix character classes instead of writing down possible characters. See man isalpha for a list of Posix character classes; the component of the character class [:xxxxx:] contains exactly the characters accepted by the standard library function isxxxxx . I fixed the template so that it allowed more than one space to be used between dataType and id and simplified the template for id s.

How to create a regular expression without a specific group of letters in lex

More articles: