Filter punctuation and UTF-8 characters from a string

Question

Filter punctuation and UTF-8 characters from a string

What is the best and most effective way to filter out all characters and punctuation characters of UTF-8, such as ✀ ✁ ✂ ✃ ✄ ✅ ✆ ✇ ✈, etc. out of line. Just filtering out all characters that are not in az, AZ and 0-9 is not an option, because I want to save letters from other languages (±, ę, - etc.), thanks in advance.

+4

java regex utf-8

user1315305 May 13, '13 at 16:33

source share

4 answers

Try using unicode binary classification combinations:

 String fixed = value.replaceAll("[^\\p{IsAlphabetic}\\p{IsDigit}]", "");

+3

rolfl May 13, '13 at 16:38

source share

The idea is to remove the emphasis first.

 public static String onlyASCII(String s) { // Decompose any ŝ into s and combining-^. String s2 = Normalizer.normalize(s, Normalizer.Form.NFD); // Removee all non-ASCII return s2.replaceAll("[^\\u0000-\\u007E\\pL]", ""); }

For Greek and such \\pL letters.

+1

Joop eggen May 13, '13 at 16:52

source share

The term "punctuation" is rather vague. The Character class provides a getType () method that displays at least some of the specific categories of characters in the Unicode specification, so this is probably the best place to start.

I would also recommend applying “positive” logic (for example, all characters and numbers), rather than “negative” logic (without punctuation), because the test is likely to be much simpler.

0

parsifal May 13, '13 at 16:38

source share

assylias · Accepted Answer · 2013-05-13T16:41:03+0000

You can use \p{L} to match all Unicode characters. Example:

 public static void main(String[] args) throws IOException { String[] test = {"asdEWR1", "ąęóöòæûùÜ", "sd,", "✀","✁","✂","✃","✄","✅","✆","✇","✈"}; for (String s : test) System.out.println(s + " => " + s.replaceAll("[^\\p{L}^\\d]", "")); }

outputs:

 asdEWR1 => asdEWR1 ąęóöòæûùÜ => ąęóöòæûùÜ sd, => sd ✀ => ✁ => ✂ => ✃ => ✄ => ✅ => ✆ => ✇ => ✈ =>

Filter punctuation and UTF-8 characters from a string

More articles: