Regex to match C ++ string constant

I'm currently working on a C ++ preprocessor, and I need to match string constants with more than 0 characters like "hey I'm a string . I'm working on it here here \"([^\\\"]+|\\.)+\" but it doesn’t work in one of my test cases.

Test cases:

 std::cout << "hello" << " world"; std::cout << "He said: \"bananas\"" << "..."; std::cout << ""; std::cout << "\x12\23\x34"; 

Expected Result:

 std::cout << String("hello") << String(" world"); std::cout << String("He said: \"bananas\"") << String("..."); std::cout << ""; std::cout << String("\x12\23\x34"); 

On the second I get instead

 std::cout << String("He said: \")bananas\"String(" << ")..."; 

Short code playback (using AR.3 regex):

 std::string in_line = "std::cout << \"He said: \\\"bananas\\\"\" << \"...\";"; std::regex r("\"([^\"]+|\\.|(?<=\\\\)\")+\""); in_line = std::regex_replace(in_line, r, "String($&)"); 
+5
source share
3 answers

Lexing the source file is a good job for regular expressions. But for such a task, let me use a better std::regex engine than std::regex . First use PCRE (or boost::regex ). At the end of this post I will show what you can do with a less functional engine.

We only need to perform partial lexing, ignoring all unrecognized tokens that will not affect string literals. We need to process:

  • Comments Singleline
  • Multi-line comments
  • Character Literals
  • String literals

We will use the extended ( x ) option, which ignores spaces in the template.

Comments

Here [lex.comment] says:

Symbols /* run a comment that ends with symbols */ . These comments are not nesting. The // characters start a comment that ends immediately before the next newline character. If there is a form character or a vertical tab character in such a comment, only space characters should appear between it and the new line that ends the comment; no diagnostics required. [Note: comment symbols // , /* and */ do not have special meaning in comments // and are processed in the same way as other characters. Similarly, the comment characters // and /* have no special meaning in the comment /* . - final note]

 # singleline comment // .* (*SKIP)(*FAIL) # multiline comment | /\* (?s: .*? ) \*/ (*SKIP)(*FAIL) 

Easy peasy. If you match something, just (*SKIP)(*FAIL) - this means that you have chosen a match. (?s: .*? ) applies the s (singleline) modifier to the metacharacter . , which means that he is allowed to match newlines.

Character Literals

Here is the grammar from [lex.ccon] :

  character-literal: encoding-prefix(opt) ' c-char-sequence ' encoding-prefix: one of u8 u UL c-char-sequence: c-char c-char-sequence c-char c-char: any member of the source character set except the single-quote ', backslash \, or new-line character escape-sequence universal-character-name escape-sequence: simple-escape-sequence octal-escape-sequence hexadecimal-escape-sequence simple-escape-sequence: one of \' \" \? \\ \a \b \f \n \r \t \v octal-escape-sequence: \ octal-digit \ octal-digit octal-digit \ octal-digit octal-digit octal-digit hexadecimal-escape-sequence: \x hexadecimal-digit hexadecimal-escape-sequence hexadecimal-digit 

First, we define a few things that we will need later:

 (?(DEFINE) (?<prefix> (?:u8?|U|L)? ) (?<escape> \\ (?: ['"?\\abfnrtv] # simple escape | [0-7]{1,3} # octal escape | x [0-9a-fA-F]{1,2} # hex escape | u [0-9a-fA-F]{4} # universal character name | U [0-9a-fA-F]{8} # universal character name )) ) 
  • prefix defined as optional u8 , u , u or L
  • escape defined as a standard, except that I have combined universal-character-name into it for simplicity

Once we have it, the character literal is pretty simple:

 (?&prefix) ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL) 

We will throw it away with (*SKIP)(*FAIL)

Simple lines

They are defined in much the same way as character literals. Here's the [lex.string] :

  string-literal: encoding-prefix(opt) " s-char-sequence(opt) " encoding-prefix(opt) R raw-string s-char-sequence: s-char s-char-sequence s-char s-char: any member of the source character set except the double-quote ", backslash \, or new-line character escape-sequence universal-character-name 

This will reflect character literals:

 (?&prefix) " (?> (?&escape) | [^"\\\r\n]+ )* " 

The differences are as follows:

  • The character sequence is optional this time ( * instead of + )
  • Double quote is not allowed if not a single quote
  • We don’t actually throw it away :)

Raw strings

Here's the raw string:

  raw-string: " d-char-sequence(opt) ( r-char-sequence(opt) ) d-char-sequence(opt) " r-char-sequence: r-char r-char-sequence r-char r-char: any member of the source character set, except a right parenthesis ) followed by the initial d-char-sequence (which may be empty) followed by a double quote ". d-char-sequence: d-char d-char-sequence d-char d-char: any member of the basic source character set except: space, the left parenthesis (, the right parenthesis ), the backslash \, and the control characters representing horizontal tab, vertical tab, form feed, and newline. 

The regular expression for this is:

 (?&prefix) R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> " 
  • [^ ()\\\t\x0B\r\n]* is the set of characters allowed in delimiters ( d-char )
  • \k<delimiter> refers to a previously matched delimiter

Full template

Full template:

 (?(DEFINE) (?<prefix> (?:u8?|U|L)? ) (?<escape> \\ (?: ['"?\\abfnrtv] # simple escape | [0-7]{1,3} # octal escape | x [0-9a-fA-F]{1,2} # hex escape | u [0-9a-fA-F]{4} # universal character name | U [0-9a-fA-F]{8} # universal character name )) ) # singleline comment // .* (*SKIP)(*FAIL) # multiline comment | /\* (?s: .*? ) \*/ (*SKIP)(*FAIL) # character literal | (?&prefix) ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL) # standard string | (?&prefix) " (?> (?&escape) | [^"\\\r\n]+ )* " # raw string | (?&prefix) R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> " 

See the demo here .

boost::regex

Here's a simple demo program using boost::regex :

 #include <string> #include <iostream> #include <boost/regex.hpp> static void test() { boost::regex re(R"regex( (?(DEFINE) (?<prefix> (?:u8?|U|L) ) (?<escape> \\ (?: ['"?\\abfnrtv] # simple escape | [0-7]{1,3} # octal escape | x [0-9a-fA-F]{1,2} # hex escape | u [0-9a-fA-F]{4} # universal character name | U [0-9a-fA-F]{8} # universal character name )) ) # singleline comment // .* (*SKIP)(*FAIL) # multiline comment | /\* (?s: .*? ) \*/ (*SKIP)(*FAIL) # character literal | (?&prefix)? ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL) # standard string | (?&prefix)? " (?> (?&escape) | [^"\\\r\n]+ )* " # raw string | (?&prefix)? R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> " )regex", boost::regex::perl | boost::regex::no_mod_s | boost::regex::mod_x | boost::regex::optimize); std::string subject(R"subject( std::cout << L"hello" << " world"; std::cout << "He said: \"bananas\"" << "..."; std::cout << ""; std::cout << "\x12\23\x34"; std::cout << u8R"hello(this"is\a\""""single\\(valid)" raw string literal)hello"; "" // empty string '"' // character literal // this is "a string literal" in a comment /* this is "also inside" //a comment */ // and this /* "is not in a comment" // */ "this is a /* string */ with nested // comments" )subject"); std::cout << boost::regex_replace(subject, re, "String\\($&\\)", boost::format_all) << std::endl; } int main(int argc, char **argv) { try { test(); } catch(std::exception ex) { std::cerr << ex.what() << std::endl; } return 0; } 

(I turned off the syntax highlighting because there are nuts on this code)

For some reason, did I have to take a quantifier ? from prefix (change (?<prefix> (?:u8?|U|L)? ) to (?<prefix> (?:u8?|U|L) ) and (?&prefix) to (?&prefix)? ) to make the template work. I believe this is a bug in boost :: regex, since both PCRE and Perl work fine with the original template.

What if we do not have a suitable regex engine?

Note that although this pattern technically uses recursion, it never uses recursive calls. Recursions could be avoided by inserting the appropriate reusable parts in the main template.

Several other designs can be avoided at the cost of reduced performance. We can safely replace atomic groups (?> ... ) with normal groups (?: ... ) if we do not add quantifiers to avoid catastrophic backtracking .

We can also avoid (*SKIP)(*FAIL) if we add one line of logic to the replacement function: all skip alternatives are grouped into a capture group. If the capture group matches, just ignore the match. If not, then this is a string literal.

All this means that we can implement this in JavaScript, which has one of the simplest regular expression engines that you can find, at the cost of violating the DRY rule and not being able to make the pattern illegible. The regular expression turns into this monster after conversion:

 (\/\/.*|\/\*[\s\S]*?\*\/|(?:u8?|U|L)?'(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^'\\\r\n])+')|(?:u8?|U|L)?"(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^"\\\r\n])*"|(?:u8?|U|L)?R"([^ ()\\\t\x0B\r\n]*)\([\s\S]*?\)\2" 

And here's an interactive demo you can play with:

 function run() { var re = /(\/\/.*|\/\*[\s\S]*?\*\/|(?:u8?|U|L)?'(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^'\\\r\n])+')|(?:u8?|U|L)?"(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^"\\\r\n])*"|(?:u8?|U|L)?R"([^ ()\\\t\x0B\r\n]*)\([\s\S]*?\)\2"/g; var input = document.getElementById("input").value; var output = input.replace(re, function(m, ignore) { return ignore ? m : "String(" + m + ")"; }); document.getElementById("output").innerText = output; } document.getElementById("input").addEventListener("input", run); run(); 
 <h2>Input:</h2> <textarea id="input" style="width: 100%; height: 50px;"> std::cout << L"hello" << " world"; std::cout << "He said: \"bananas\"" << "..."; std::cout << ""; std::cout << "\x12\23\x34"; std::cout << u8R"hello(this"is\a\""""single\\(valid)" raw string literal)hello"; "" // empty string '"' // character literal // this is "a string literal" in a comment /* this is "also inside" //a comment */ // and this /* "is not in a comment" // */ "this is a /* string */ with nested // comments" </textarea> <h2>Output:</h2> <pre id="output"></pre> 
+4
source

Read the relevant sections from the C ++ standard, they are called lex.ccon and lex.string .

Then convert each rule that you find there into a regular expression (if you really want to use regular expressions, it may turn out that they are not able to perform this task).

Then build more complex regular expressions from them. Be sure to name your regular expressions exactly the same as the rules from the C ++ standard so you can re-check them later.

If instead of using regular expressions you want to use an existing tool, here is one of them: http://clang.llvm.org/doxygen/Lexer_8cpp_source.html . Take a look at the LexStringLiteral function.

+2
source

Regular expressions can be tricky for beginners, but once you understand the basics and have tested the divide and win strategy well, it will be your goto tool.

What you need to look for in a quote (") that does not start with a backslash () and read all characters until the next quote.

A regular expression has appeared (".*?[^\\]") . See code snippet below.

 std::string in_line = "std::cout << \"He said: \\\"bananas\\\"\" << \"...\";"; std::regex re(R"((".*?[^\\]"))"); in_line = std::regex_replace(in_line, re, "String($1)"); std::cout << in_line << endl; 

Output:

 std::cout << String("He said: \"bananas\"") << String("..."); 

Regex Explanation:

 (".*?[^\\]") 

Options: case sensitive; License Plate Capture; Allow zero-length matches; only regex syntax

  • Matches the regular expression below and captures its match with trackback number 1 (".*?[^\\]")
    • Matches the literal character
    • Matches any single character that is NOT a line break character (line feed, carriage return) .*?
      • Between zero and unlimited time, as little as possible, expanding as needed (lazy) *?
    • Matches any character that is NOT a backslash character [^\\]
    • Matches the literal character

String ($ 1)

  • Insert character string "String" literally String
  • Insert an opening bracket (
  • Paste the text last associated with the group number entry 1 $1
  • Insert a closing bracket )
+1
source

Source: https://habr.com/ru/post/1263495/


All Articles