How to match a comment if it is not specified in the quoted line?

So I have a line:

//Blah blah blach // sdfkjlasdf "Another //thing" 

And I use java regex to replace all double slashes strings:

 theString = Pattern.compile("//(.*?)\\n", Pattern.DOTALL).matcher(theString).replaceAll(""); 

And it works for the most part, but the problem is that it removes all the occurrences, and I need to find a way so that it does not delete the quoted occurrence. How can I do it?

+4
source share
5 answers

Instead of using a parser that parses the entire Java source file or writes something yourself that parses only the parts that interest you, you can use some kind of third-party tool such as ANTLR.

ANTLR can only detect tokens that interest you (and, of course, tokens that can ruin your token stream, for example multiline comments and String- and char literals). Therefore, you only need to define lexer (another word for the tokenizer) that correctly processes these tokens.

This is called grammar. In ANTLR, such a grammar might look like this:

 lexer grammar FuzzyJavaLexer; options{filter=true;} SingleLineComment : '//' ~( '\r' | '\n' )* ; MultiLineComment : '/*' .* '*/' ; StringLiteral : '"' ( '\\' . | ~( '"' | '\\' ) )* '"' ; CharLiteral : '\'' ( '\\' . | ~( '\'' | '\\' ) )* '\'' ; 

Save this in a file called FuzzyJavaLexer.g . Now download ANTLR 3.2 here and save it in the same folder as your FuzzyJavaLexer.g file.

Run the following command:

 java -cp antlr-3.2.jar org.antlr.Tool FuzzyJavaLexer.g 

which will create the source class FuzzyJavaLexer.java .

Of course, you need to test the lexer, which you can do by creating a file called FuzzyJavaLexerTest.java and copying the code below:

 import org.antlr.runtime.*; public class FuzzyJavaLexerTest { public static void main(String[] args) throws Exception { String source = "class Test { \n"+ " String s = \" ... \\\" // no comment \"; \n"+ " /* \n"+ " * also no comment: // foo \n"+ " */ \n"+ " char quote = '\"'; \n"+ " // yes, a comment, finally!!! \n"+ " int i = 0; // another comment \n"+ "} \n"; System.out.println("===== source ====="); System.out.println(source); System.out.println("=================="); ANTLRStringStream in = new ANTLRStringStream(source); FuzzyJavaLexer lexer = new FuzzyJavaLexer(in); CommonTokenStream tokens = new CommonTokenStream(lexer); for(Object obj : tokens.getTokens()) { Token token = (Token)obj; if(token.getType() == FuzzyJavaLexer.SingleLineComment) { System.out.println("Found a SingleLineComment on line "+token.getLine()+ ", starting at column "+token.getCharPositionInLine()+ ", text: "+token.getText()); } } } } 

Then compile your FuzzyJavaLexer.java and FuzzyJavaLexerTest.java by doing:

 javac -cp .:antlr-3.2.jar *.java 

and finally execute the FuzzyJavaLexerTest.class file:

 // *nix/MacOS java -cp .:antlr-3.2.jar FuzzyJavaLexerTest 

or

 // Windows java -cp .;antlr-3.2.jar FuzzyJavaLexerTest 

then on the console you will see the following:

 ===== source ===== class Test { String s = " ... \" // no comment "; /* * also no comment: // foo */ char quote = '"'; // yes, a comment, finally!!! int i = 0; // another comment } ================== Found a SingleLineComment on line 7, starting at column 2, text: // yes, a comment, finally!!! Found a SingleLineComment on line 8, starting at column 13, text: // another comment 

Pretty easy, huh :)

+4
source

Use a parser, define it char -by-char.

Kickoff example:

 StringBuilder builder = new StringBuilder(); boolean quoted = false; for (String line : string.split("\\n")) { for (int i = 0; i < line.length(); i++) { char c = line.charAt(i); if (c == '"') { quoted = !quoted; } if (!quoted && c == '/' && i + 1 < line.length() && line.charAt(i + 1) == '/') { break; } else { builder.append(c); } } builder.append("\n"); } String parsed = builder.toString(); System.out.println(parsed); 
+2
source

(This is the answer to the @finnw question in the comment in his answer . This is not so much an answer to the OP question as an extended explanation of why the regex is the wrong tool.)

Here is my test code:

 String r0 = "(?m)^((?:[^\"]|\"(?:[^\"]|\\\")*\")*)//.*$"; String r1 = "(?m)^((?:[^\"\r\n]|\"(?:[^\"\r\n]|\\\")*\")*)//.*$"; String r2 = "(?m)^((?:[^\"\r\n]|\"(?:[^\"\r\n\\\\]|\\\\\")*\")*)//.*$"; String test = "class Test { \n"+ " String s = \" ... \\\" // no comment \"; \n"+ " /* \n"+ " * also no comment: // but no harm \n"+ " */ \n"+ " /* no comment: // much harm */ \n"+ " char quote = '\"'; // comment \n"+ " // another comment \n"+ " int i = 0; // and another \n"+ "} \n" .replaceAll(" +$", ""); System.out.printf("%n%s%n", test); System.out.printf("%n%s%n", test.replaceAll(r0, "$1")); System.out.printf("%n%s%n", test.replaceAll(r1, "$1")); System.out.printf("%n%s%n", test.replaceAll(r2, "$1")); 

r0 is the edited regular expression from your answer; it only removes the final comment ( // and another ), because everything else is matched in group (1). Setting multi-line mode ( (?m) ) is necessary for ^ and $ to work correctly, but this does not solve this problem, because character classes can still match newline characters.

r1 deals with a newline problem, but it still incorrectly matches // no comment in a string literal for two reasons: you did not specify a backslash in the first part (?:[^\"\r\n]|\\\") ; and you used only two of them to combine the backslash in the second part.

r2 fixes this, but it does not try to figure out the quote in the char literal or single-line comments inside multi-line comments. They may also be processed, but this regular expression is already Baby Godzilla; Do you really want it all to grow?

+1
source

The following is a grep-like program that I wrote (in Perl) a few years ago. It has the ability to disable java comments before processing the file:

 # ============================================================================ # ============================================================================ # # strip_java_comments # ------------------- # # Strip the comments from a Java-like file. Multi-line comments are # replaced with the equivalent number of blank lines so that all text # left behind stays on the same line. # # Comments are replaced by at least one space . # # The text for an entire file is assumed to be in $_ and is returned # in $_ # # ============================================================================ # ============================================================================ sub strip_java_comments { s!( (?: \" [^\"\\]* (?: \\. [^\"\\]* )* \" ) | (?: \' [^\'\\]* (?: \\. [^\'\\]* )* \' ) | (?: \/\/ [^\n] *) | (?: \/\* .*? \*\/) ) ! my $x = $1; my $first = substr($x, 0, 1); if ($first eq '/') { "\n" x ($x =~ tr/\n//); } else { $x; } !esxg; } 

This code really works correctly and cannot be fooled by complex combinations of comments / quotes. It will probably be tricked with unicode screens (\ u0022 etc.), but you can easily deal with them if you want.

Like Perl, not java, the replacement code must change. I will have a quick crack when creating an equivalent java. Stand up ...

EDIT: I just whipped it. Probably need work:

 // The trick is to search for both comments and quoted strings. // That way we won't notice a (partial or full) comment withing a quoted string // or a (partial or full) quoted-string within a comment. // (I may not have translated the back-slashes accurately. You'll figure it out) Pattern p = Pattern.compile( "( (?: \" [^\"\\\\]* (?: \\\\. [^\"\\\\]* )* \" )" + // " ... " " | (?: ' [^'\\\\]* (?: \\\\. [^'\\\\]* )* ' )" + // or ' ... ' " | (?: // [^\\n] * )" + // or // ... " | (?: /\\* .*? \\* / )" + // or /* ... */ ")", Pattern.DOTALL | Pattern.COMMENTS ); Matcher m = p.matcher(entireInputFileAsAString); StringBuilder output = new StringBuilder(); while (m.find()) { if (m.group(1).startsWith("/")) { // This is a comment. Replace it with a space... m.appendReplacement(output, " "); // ... or replace it with an equivalent number of newlines // (exercise for reader) } else { // We matched a quoted string. Put it back m.appendReplacement(output, "$1"); } } m.appendTail(output); return output.toString(); 
+1
source

You cannot specify the use of a regular expression if you are in a double-quoted string or not. In the end, the regex is just a state machine (sometimes extended abit). I would use the parser provided by BalusC, or this one .

If you want to know why regex is limited, read about formal grammars. Good start - wikipedia article .

0
source

Source: https://habr.com/ru/post/1301634/


All Articles