Lexing the source file is a good job for regular expressions. But for such a task, let me use a better std::regex engine than std::regex . First use PCRE (or boost::regex ). At the end of this post I will show what you can do with a less functional engine.
We only need to perform partial lexing, ignoring all unrecognized tokens that will not affect string literals. We need to process:
- Comments Singleline
- Multi-line comments
- Character Literals
- String literals
We will use the extended ( x ) option, which ignores spaces in the template.
Comments
Here [lex.comment] says:
Symbols /* run a comment that ends with symbols */ . These comments are not nesting. The // characters start a comment that ends immediately before the next newline character. If there is a form character or a vertical tab character in such a comment, only space characters should appear between it and the new line that ends the comment; no diagnostics required. [Note: comment symbols // , /* and */ do not have special meaning in comments // and are processed in the same way as other characters. Similarly, the comment characters // and /* have no special meaning in the comment /* . - final note]
# singleline comment // .* (*SKIP)(*FAIL) # multiline comment | /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)
Easy peasy. If you match something, just (*SKIP)(*FAIL) - this means that you have chosen a match. (?s: .*? ) applies the s (singleline) modifier to the metacharacter . , which means that he is allowed to match newlines.
Character Literals
Here is the grammar from [lex.ccon] :
character-literal: encoding-prefix(opt) ' c-char-sequence ' encoding-prefix: one of u8 u UL c-char-sequence: c-char c-char-sequence c-char c-char: any member of the source character set except the single-quote ', backslash \, or new-line character escape-sequence universal-character-name escape-sequence: simple-escape-sequence octal-escape-sequence hexadecimal-escape-sequence simple-escape-sequence: one of \' \" \? \\ \a \b \f \n \r \t \v octal-escape-sequence: \ octal-digit \ octal-digit octal-digit \ octal-digit octal-digit octal-digit hexadecimal-escape-sequence: \x hexadecimal-digit hexadecimal-escape-sequence hexadecimal-digit
First, we define a few things that we will need later:
(?(DEFINE) (?<prefix> (?:u8?|U|L)? ) (?<escape> \\ (?: ['"?\\abfnrtv] # simple escape | [0-7]{1,3} # octal escape | x [0-9a-fA-F]{1,2} # hex escape | u [0-9a-fA-F]{4} # universal character name | U [0-9a-fA-F]{8} # universal character name )) )
prefix defined as optional u8 , u , u or Lescape defined as a standard, except that I have combined universal-character-name into it for simplicity
Once we have it, the character literal is pretty simple:
(?&prefix) ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)
We will throw it away with (*SKIP)(*FAIL)
Simple lines
They are defined in much the same way as character literals. Here's the [lex.string] :
string-literal: encoding-prefix(opt) " s-char-sequence(opt) " encoding-prefix(opt) R raw-string s-char-sequence: s-char s-char-sequence s-char s-char: any member of the source character set except the double-quote ", backslash \, or new-line character escape-sequence universal-character-name
This will reflect character literals:
(?&prefix) " (?> (?&escape) | [^"\\\r\n]+ )* "
The differences are as follows:
- The character sequence is optional this time (
* instead of + ) - Double quote is not allowed if not a single quote
- We donβt actually throw it away :)
Raw strings
Here's the raw string:
raw-string: " d-char-sequence(opt) ( r-char-sequence(opt) ) d-char-sequence(opt) " r-char-sequence: r-char r-char-sequence r-char r-char: any member of the source character set, except a right parenthesis ) followed by the initial d-char-sequence (which may be empty) followed by a double quote ". d-char-sequence: d-char d-char-sequence d-char d-char: any member of the basic source character set except: space, the left parenthesis (, the right parenthesis ), the backslash \, and the control characters representing horizontal tab, vertical tab, form feed, and newline.
The regular expression for this is:
(?&prefix) R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "
[^ ()\\\t\x0B\r\n]* is the set of characters allowed in delimiters ( d-char )\k<delimiter> refers to a previously matched delimiter
Full template
Full template:
(?(DEFINE) (?<prefix> (?:u8?|U|L)? ) (?<escape> \\ (?: ['"?\\abfnrtv] # simple escape | [0-7]{1,3} # octal escape | x [0-9a-fA-F]{1,2} # hex escape | u [0-9a-fA-F]{4} # universal character name | U [0-9a-fA-F]{8} # universal character name )) ) # singleline comment // .* (*SKIP)(*FAIL) # multiline comment | /\* (?s: .*? ) \*/ (*SKIP)(*FAIL) # character literal | (?&prefix) ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL) # standard string | (?&prefix) " (?> (?&escape) | [^"\\\r\n]+ )* " # raw string | (?&prefix) R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "
See the demo here .
boost::regex
Here's a simple demo program using boost::regex :
#include <string> #include <iostream> #include <boost/regex.hpp> static void test() { boost::regex re(R"regex( (?(DEFINE) (?<prefix> (?:u8?|U|L) ) (?<escape> \\ (?: ['"?\\abfnrtv] # simple escape | [0-7]{1,3} # octal escape | x [0-9a-fA-F]{1,2} # hex escape | u [0-9a-fA-F]{4} # universal character name | U [0-9a-fA-F]{8} # universal character name )) ) # singleline comment // .* (*SKIP)(*FAIL) # multiline comment | /\* (?s: .*? ) \*/ (*SKIP)(*FAIL) # character literal | (?&prefix)? ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL) # standard string | (?&prefix)? " (?> (?&escape) | [^"\\\r\n]+ )* " # raw string | (?&prefix)? R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> " )regex", boost::regex::perl | boost::regex::no_mod_s | boost::regex::mod_x | boost::regex::optimize); std::string subject(R"subject( std::cout << L"hello" << " world"; std::cout << "He said: \"bananas\"" << "..."; std::cout << ""; std::cout << "\x12\23\x34"; std::cout << u8R"hello(this"is\a\""""single\\(valid)" raw string literal)hello"; "" // empty string '"' // character literal // this is "a string literal" in a comment /* this is "also inside" //a comment */ // and this /* "is not in a comment" // */ "this is a /* string */ with nested // comments" )subject"); std::cout << boost::regex_replace(subject, re, "String\\($&\\)", boost::format_all) << std::endl; } int main(int argc, char **argv) { try { test(); } catch(std::exception ex) { std::cerr << ex.what() << std::endl; } return 0; }
(I turned off the syntax highlighting because there are nuts on this code)
For some reason, did I have to take a quantifier ? from prefix (change (?<prefix> (?:u8?|U|L)? ) to (?<prefix> (?:u8?|U|L) ) and (?&prefix) to (?&prefix)? ) to make the template work. I believe this is a bug in boost :: regex, since both PCRE and Perl work fine with the original template.
What if we do not have a suitable regex engine?
Note that although this pattern technically uses recursion, it never uses recursive calls. Recursions could be avoided by inserting the appropriate reusable parts in the main template.
Several other designs can be avoided at the cost of reduced performance. We can safely replace atomic groups (?> ... ) with normal groups (?: ... ) if we do not add quantifiers to avoid catastrophic backtracking .
We can also avoid (*SKIP)(*FAIL) if we add one line of logic to the replacement function: all skip alternatives are grouped into a capture group. If the capture group matches, just ignore the match. If not, then this is a string literal.
All this means that we can implement this in JavaScript, which has one of the simplest regular expression engines that you can find, at the cost of violating the DRY rule and not being able to make the pattern illegible. The regular expression turns into this monster after conversion:
(\/\/.*|\/\*[\s\S]*?\*\/|(?:u8?|U|L)?'(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^'\\\r\n])+')|(?:u8?|U|L)?"(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^"\\\r\n])*"|(?:u8?|U|L)?R"([^ ()\\\t\x0B\r\n]*)\([\s\S]*?\)\2"
And here's an interactive demo you can play with:
function run() { var re = /(\/\/.*|\/\*[\s\S]*?\*\/|(?:u8?|U|L)?'(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^'\\\r\n])+')|(?:u8?|U|L)?"(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^"\\\r\n])*"|(?:u8?|U|L)?R"([^ ()\\\t\x0B\r\n]*)\([\s\S]*?\)\2"/g; var input = document.getElementById("input").value; var output = input.replace(re, function(m, ignore) { return ignore ? m : "String(" + m + ")"; }); document.getElementById("output").innerText = output; } document.getElementById("input").addEventListener("input", run); run();
<h2>Input:</h2> <textarea id="input" style="width: 100%; height: 50px;"> std::cout << L"hello" << " world"; std::cout << "He said: \"bananas\"" << "..."; std::cout << ""; std::cout << "\x12\23\x34"; std::cout << u8R"hello(this"is\a\""""single\\(valid)" raw string literal)hello"; "" // empty string '"' // character literal // this is "a string literal" in a comment /* this is "also inside" //a comment */ // and this /* "is not in a comment" // */ "this is a /* string */ with nested // comments" </textarea> <h2>Output:</h2> <pre id="output"></pre>