Filter punctuation and UTF-8 characters from a string

What is the best and most effective way to filter out all characters and punctuation characters of UTF-8, such as ✀ ✁ ✂ ✃ ✄ ✅ ✆ ✇ ✈, etc. out of line. Just filtering out all characters that are not in az, AZ and 0-9 is not an option, because I want to save letters from other languages ​​(±, ę, - etc.), thanks in advance.

+4
source share
4 answers

You can use \p{L} to match all Unicode characters. Example:

 public static void main(String[] args) throws IOException { String[] test = {"asdEWR1", "ąęóöòæûùÜ", "sd,", "✀","✁","✂","✃","✄","✅","✆","✇","✈"}; for (String s : test) System.out.println(s + " => " + s.replaceAll("[^\\p{L}^\\d]", "")); } 

outputs:

 asdEWR1 => asdEWR1 ąęóöòæûùÜ => ąęóöòæûùÜ sd, => sd ✀ => ✁ => ✂ => ✃ => ✄ => ✅ => ✆ => ✇ => ✈ => 
+3
source

Try using unicode binary classification combinations:

 String fixed = value.replaceAll("[^\\p{IsAlphabetic}\\p{IsDigit}]", ""); 
+3
source

The idea is to remove the emphasis first.

 public static String onlyASCII(String s) { // Decompose any ŝ into s and combining-^. String s2 = Normalizer.normalize(s, Normalizer.Form.NFD); // Removee all non-ASCII return s2.replaceAll("[^\\u0000-\\u007E\\pL]", ""); } 

For Greek and such \\pL letters.

+1
source

The term "punctuation" is rather vague. The Character class provides a getType () method that displays at least some of the specific categories of characters in the Unicode specification, so this is probably the best place to start.

I would also recommend applying “positive” logic (for example, all characters and numbers), rather than “negative” logic (without punctuation), because the test is likely to be much simpler.

0
source

Source: https://habr.com/ru/post/1480536/


All Articles