Instead of using a parser that parses the entire Java source file or writes something yourself that parses only the parts that interest you, you can use some kind of third-party tool such as ANTLR.
ANTLR can only detect tokens that interest you (and, of course, tokens that can ruin your token stream, for example multiline comments and String- and char literals). Therefore, you only need to define lexer (another word for the tokenizer) that correctly processes these tokens.
This is called grammar. In ANTLR, such a grammar might look like this:
lexer grammar FuzzyJavaLexer; options{filter=true;} SingleLineComment : '//' ~( '\r' | '\n' )* ; MultiLineComment : '/*' .* '*/' ; StringLiteral : '"' ( '\\' . | ~( '"' | '\\' ) )* '"' ; CharLiteral : '\'' ( '\\' . | ~( '\'' | '\\' ) )* '\'' ;
Save this in a file called FuzzyJavaLexer.g . Now download ANTLR 3.2 here and save it in the same folder as your FuzzyJavaLexer.g file.
Run the following command:
java -cp antlr-3.2.jar org.antlr.Tool FuzzyJavaLexer.g
which will create the source class FuzzyJavaLexer.java .
Of course, you need to test the lexer, which you can do by creating a file called FuzzyJavaLexerTest.java and copying the code below:
import org.antlr.runtime.*; public class FuzzyJavaLexerTest { public static void main(String[] args) throws Exception { String source = "class Test { \n"+ " String s = \" ... \\\" // no comment \"; \n"+ " /* \n"+ " * also no comment: // foo \n"+ " */ \n"+ " char quote = '\"'; \n"+ " // yes, a comment, finally!!! \n"+ " int i = 0; // another comment \n"+ "} \n"; System.out.println("===== source ====="); System.out.println(source); System.out.println("=================="); ANTLRStringStream in = new ANTLRStringStream(source); FuzzyJavaLexer lexer = new FuzzyJavaLexer(in); CommonTokenStream tokens = new CommonTokenStream(lexer); for(Object obj : tokens.getTokens()) { Token token = (Token)obj; if(token.getType() == FuzzyJavaLexer.SingleLineComment) { System.out.println("Found a SingleLineComment on line "+token.getLine()+ ", starting at column "+token.getCharPositionInLine()+ ", text: "+token.getText()); } } } }
Then compile your FuzzyJavaLexer.java and FuzzyJavaLexerTest.java by doing:
javac -cp .:antlr-3.2.jar *.java
and finally execute the FuzzyJavaLexerTest.class file:
// *nix/MacOS java -cp .:antlr-3.2.jar FuzzyJavaLexerTest
or
then on the console you will see the following:
===== source ===== class Test { String s = " ... \" // no comment "; char quote = '"';
Pretty easy, huh :)