How to remove invalid unicode characters from strings in java

I use the CoreNLP Neural Network Dependency Parser to analyze some content on social networks. Unfortunately, the file contains characters that, according to fileformat.info , are not valid Unicode characters or Unicode replacement characters. This is, for example, U + D83D or U + FFFD . If these characters are in the file, coreNLP responds with error messages as follows:

Nov 15, 2015 5:15:38 PM edu.stanford.nlp.process.PTBLexer next WARNING: Untokenizable: ? (U+D83D, decimal: 55357) 

Based on this answer, I tried document.replaceAll("\\p{C}", ""); just delete these characters. document here is just a document as a string. But it did not help.

How to remove these characters from a string before passing it to coreNLP?

UPDATE (November 16th):

For completeness, I should mention that I asked this question only to avoid a huge number of error messages by preprocessing the file. CoreNLP simply ignores characters that it cannot process, so this is not a problem.

+6
source share
4 answers

In a sense, both answers provided by Mukesh Kumar and GsusRecovery help, but not completely correctly.

 document.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", ""); 

seems to replace all invalid characters. But CoreNLP does not seem to support it yet. I manually computed them by running my whole body parser, which led to the following:

 document.replaceAll("[\\uD83D\\uFFFD\\uFE0F\\u203C\\u3010\\u3011\\u300A\\u166D\\u200C\\u202A\\u202C\\u2049\\u20E3\\u300B\\u300C\\u3030\\u065F\\u0099\\u0F3A\\u0F3B\\uF610\\uFFFC]", ""); 

So now I run two replaceAll() commands before passing the document to the parser. Full code snippet

 // remove invalid unicode characters String tmpDoc1 = document.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", ""); // remove other unicode characters coreNLP can't handle String tmpDoc2 = tmpDoc1.replaceAll("[\\uD83D\\uFFFD\\uFE0F\\u203C\\u3010\\u3011\\u300A\\u166D\\u200C\\u202A\\u202C\\u2049\\u20E3\\u300B\\u300C\\u3030\\u065F\\u0099\\u0F3A\\u0F3B\\uF610\\uFFFC]", ""); DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(tmpDoc2)); for (List<HasWord> sentence : tokenizer) { List<TaggedWord> tagged = tagger.tagSentence(sentence); GrammaticalStructure gs = parser.predict(tagged); System.err.println(gs); } 

This is not necessarily a complete list of unsupported characters, so I opened issue on GitHub .

Please note that CoreNLP automatically removes these unsupported characters. The only reason I want to pre-process my case is to avoid all these error messages.

UPDATE November 27th

Christopher Manning just answered the GitHub Issue I opened. There are several ways to handle these characters using the edu.stanford.nlp.process.TokenizerFactory; class edu.stanford.nlp.process.TokenizerFactory; . Take this sample code to tokenize a document:

 DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(document)); TokenizerFactory<? extends HasWord> factory=null; factory=PTBTokenizer.factory(); factory.setOptions("untokenizable=noneDelete"); tokenizer.setTokenizerFactory(factory); for (List<HasWord> sentence : tokenizer) { // do something with the sentence } 

You can replace noneDelete in line 4 with other parameters. I quote Manning:

"... , allKeep. "

This means that to save characters without receiving all of these error messages, the best way is to use the noneKeep option. This method is more elegant than any attempt to remove these characters.

+6
source

Remove certain unwanted characters with:

 document.replaceAll("[\\uD83D\\uFFFD\\uFE0F\\u203C\\u3010]", ""); 

If you find that other unwanted characters are simply added with the same scheme to the list.

UPDATE

Unicode patterns are separated by a regular expression engine in 7 macrogroups (and several subgroups) identified by a single letter (macrogroup) or two letters (subgroup).

Based on my arguments for your examples and unicode classes specified in the always good Regular Expression site resource, I think you can try a unique only-good-pass , for example:

 document.replaceAll("[^\\p{L}\\p{N}\\p{Z}\\p{Sm}\\p{Sc}\\p{Sk}\\p{Pi}\\p{Pf}\\p{Pc}\\p{Mc}]","") 

This regular expression removes everything that is not:

  • \p{L} : writing in any language
  • \p{N} : number
  • \p{Z} : any space or invisible delimiter
  • \p{Sm}\p{Sc}\p{Sk} : mathematical, currency or common characters as single char
  • \p{Mc}* : a character that must be combined with another character that takes up extra space (vowels in many eastern languages).
  • \p{Pi}\p{Pf}\p{Pc}* : introductory quote, final quote, word connectors (i.e. underscore)

* : I think these groups can also be deleted for CoreNPL purposes.

This way you only need one regular expression filter, and you can process groups of characters (for the same purpose) instead of single cases.

+2
source

Just as you have a line like

String xml = "...."; xml = xml.replaceAll ("[^ \ u0009 \ u000a \ u000d \ u0020- \ uD7FF \ uE000- \ uFFFD]", "");

This will solve your problem.

+1
source

There is a negative effect in other places when we replace everything. So, I suggest replacing characters if they are not BPM characters, as shown below

 private String removeNonBMPCharacters(final String input) { StringBuilder strBuilder = new StringBuilder(); input.codePoints().forEach((i) -> { if (Character.isSupplementaryCodePoint(i)) { strBuilder.append("?"); } else { strBuilder.append(Character.toChars(i)); } }); return strBuilder.toString(); } 
0
source

Source: https://habr.com/ru/post/1235997/


All Articles