To get rid of a string containing spaces or nothing, you can use this regex:
(?m)^[ \t]*[\r\n]+
Your regex ^[\s|\t]*$\n will work if you set multiline mode ( (?m) ), but it is still incorrect. First, | matches literal | ; no need to specify "or" in a character class. For the other, \s matches any space character, including TAB ( \t ), carriage return ( \r ) and linefeed ( \n ), which makes it unnecessarily unnecessary and inefficient. For example, in the first empty line (after the end of the first Sub ) ^[\s|\t]* will first try to match everything before the word Public , then it will return to the end of the previous line, where $\n can match.
But an empty line, in addition to an empty one or containing only horizontal spaces (spaces or TAB), may also contain a comment. I prefer to treat these lines only for comments as empty lines, because it is relatively easy to do, and it simplifies the task of matching comments in non-empty lines, which is much more complicated. Here is my regex:
^[ \t]*(?:(?:REM|')[^\r\n]*)?[\r\n]+
After using any leading horizontal space, if I see a REM or ' character denoting a comment, I use this and everything after it until the next line separator. Please note that the only thing required for presence is the line separator itself. Also note the lack of end anchor, $ . This should never be used when you explicitly match string separators, in which case it will break the regex. In multi-line mode, $ matches only before the line ( \n ), and not before the carriage return ( \r ). (This behavior of .NET code is incorrect and rather unexpected, given that Microsoft has long preferred \r\n as a line separator.)
Combining the remaining comments is a fundamentally different task. As you have discovered, just searching for REM or ' does not work, because you can find it in a string literal where this does not mean the beginning of the comment. What you need to do is start at the beginning of the line, consuming and capturing everything that is not the beginning of a comment or string literal. If you find a double quote, continue and use a string literal. If you find REM or ' , stop capturing and continue and use the rest of the line. Then you replace the entire line with only the captured part - i.e. All to comment. Here's the regex:
(?mn)^(?<line>[^\r\n"R']*(("[^"]*"|(?!REM)R)[^\r\n"R']*)*)(REM|')[^\r\n]*
Or, more readably:
(?mn) # Multiline and ExplicitCapture modes ^ # beginning of line (?<line> # capture in group "line" [^\r\n"R']* # any number of "safe" characters ( ( "[^"]*" # a string literal | (?!REM)R # 'R' if it not the beginning of 'REM' ) [^\r\n"R']* # more "safe" characters )* ) # stop capturing (?:REM|') # a comment sigil [^\r\n]* # consume the rest of the line
The replacement string will be "${line}" . Some other notes:
- Note that this regular expression does not end with
[\r\n]+ in order to use the line separator, as the regular expression "empty lines" does. - It does not end with
$ for the same reason as before. [^\r\n]* will eagerly consume everything before the line separator, so no anchor is needed. - The only thing that should be present is
REM or ' ; we are not trying to match any line that does not contain a comment. - ExplicitCapture mode means that I can use
(...) instead of (?:...) for all groups that I don't want to capture, but the named group (?<line>...) still works. - Be that as it may, this regular expression will be much worse if VB supports multi-line comments or if its string literals support backslash.
I do not do VB, but here is a demo in C # .