How to scan identifiers correctly with Ragel

I am trying to write a scanner for my programming language C / C ++ / C # / Java / D, which I develop for personal reasons. For this task, I use Ragel to create my scanner. I had problems understanding when many operators start actions, probably because my scientists focused on practical knowledge and not on theory, and many of this non-deterministic / deterministic finite state machine business go right over my head. I believe that the documentation is either missing, or I understand that it is. I guess the latter.

In any case, I work with the database. In my first iteration, I defined several keywords and special characters. Now I am faced with a problem when all keywords are scanned as identifiers. I use the scanner operator for all my keywords, as this resolved my issue with the return string, which is scanned as the return and returns keyword.

How to scan identifiers correctly? I understand that in order to make it deterministic, I need to effectively indicate that a token can only be an identifier if it does not match another token pattern. Forgive my lack of knowledge.

Ragel Script:

 %%{ Identifier = (alpha | '_') . (alnum | '_')*; action IdentifierAction { std::cout << "identifier(\""; std::cout.write(ts, te - ts); std::cout << "\")"; } }%% %%{ main := |* Interface => InterfaceAction; Class => ClassAction; Property => PropertyAction; Function => FunctionAction; TypeQualifier => TypeQualifierAction; OpenParenthesis => OpenParenthesisAction; CloseParenthesis => CloseParenthesisAction; OpenBracket => OpenBracketAction; CloseBracket => CloseBracketAction; OpenBrace => OpenBraceAction; CloseBrace => CloseBraceAction; Semicolon => SemicolonAction; Returns => ReturnsAction; Return => ReturnAction; Identifier => IdentifierAction; space+; *|; }%% 
+4
source share
1 answer

Not familiar with Ragel, but some custom parsers and scanners did.

Your question seems to be more related to finding keywords than finding common identifiers.

You have rules telling Ragel to determine when a section is a number, the keyword "return", a semicolon, the keyword "returns", an identifier, etc. Altought, you can make a rule for each keyword, I will not recommend it.

What I learned from experience is that it’s better to read all the explication keywords as identifiers (assign a common identifier token), and to find out which identifiers are “keywords” in some part of your C / C ++ code.

In other words. Ragel will only detect identifiers. "myvar", "return" and "returns" will be marked as "identifiers". Later, in your semantic action code ( C / C ++ is not Ragel ), you will check each identifier and determine if this is a keyword in C / C ++. This is usually done using a list of keywords.

I think it will be something like this:

 %%{ Identifier = (alpha | '_') . (alnum | '_')*; action IdentifierAction { String Keywords[] = ( "return", "if", "else" ); String MyIdentifier = te - ts; if (SearchKeywordCode(Keywords, MyIdentifier)) { std::cout << "keyword(\""; std::cout.write(ts, te - ts); std::cout << "\")"; } else { std::cout << "identifier(\""; std::cout.write(ts, te - ts); std::cout << "\")"; } } }%% 

So, there is no "Return" or "Return" rule, just an "Identifier".

+6
source

Source: https://habr.com/ru/post/1342522/


All Articles