In a sense, both answers provided by Mukesh Kumar and GsusRecovery help, but not completely correctly.
document.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", "");
seems to replace all invalid characters. But CoreNLP does not seem to support it yet. I manually computed them by running my whole body parser, which led to the following:
document.replaceAll("[\\uD83D\\uFFFD\\uFE0F\\u203C\\u3010\\u3011\\u300A\\u166D\\u200C\\u202A\\u202C\\u2049\\u20E3\\u300B\\u300C\\u3030\\u065F\\u0099\\u0F3A\\u0F3B\\uF610\\uFFFC]", "");
So now I run two replaceAll() commands before passing the document to the parser. Full code snippet
// remove invalid unicode characters String tmpDoc1 = document.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", ""); // remove other unicode characters coreNLP can't handle String tmpDoc2 = tmpDoc1.replaceAll("[\\uD83D\\uFFFD\\uFE0F\\u203C\\u3010\\u3011\\u300A\\u166D\\u200C\\u202A\\u202C\\u2049\\u20E3\\u300B\\u300C\\u3030\\u065F\\u0099\\u0F3A\\u0F3B\\uF610\\uFFFC]", ""); DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(tmpDoc2)); for (List<HasWord> sentence : tokenizer) { List<TaggedWord> tagged = tagger.tagSentence(sentence); GrammaticalStructure gs = parser.predict(tagged); System.err.println(gs); }
This is not necessarily a complete list of unsupported characters, so I opened issue on GitHub .
Please note that CoreNLP automatically removes these unsupported characters. The only reason I want to pre-process my case is to avoid all these error messages.
UPDATE November 27th
Christopher Manning just answered the GitHub Issue I opened. There are several ways to handle these characters using the edu.stanford.nlp.process.TokenizerFactory; class edu.stanford.nlp.process.TokenizerFactory; . Take this sample code to tokenize a document:
DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(document)); TokenizerFactory<? extends HasWord> factory=null; factory=PTBTokenizer.factory(); factory.setOptions("untokenizable=noneDelete"); tokenizer.setTokenizerFactory(factory); for (List<HasWord> sentence : tokenizer) {
You can replace noneDelete in line 4 with other parameters. I quote Manning:
"... , allKeep. "
This means that to save characters without receiving all of these error messages, the best way is to use the noneKeep option. This method is more elegant than any attempt to remove these characters.