Java regex to support Unicode?

To match AZ, we will use the regular expression:

[A-Za-Z]

How to allow regex to match utf8 characters entered by user? For example, Chinese words such as 环保 部

+47
java regex unicode cjk
Jun 05 2018-12-12T00:
source share
4 answers

What you are looking for are Unicode properties.

eg. \p{L} is any letter from any language

Thus, the regular expression corresponding to such a Chinese word may be something like

 \p{L}+ 

There are many such properties, for more details see regular-expressions.info

Another option is to use a modifier

Pattern.UNICODE_CHARACTER_CLASS

In Java 7, there is a new property Pattern.UNICODE_CHARACTER_CLASS that allows the Unicode version for predefined character classes see my answer here for more details and links p>

You can do something like this

 Pattern p = Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS); 

and \w will match all letters and all numbers from any language (and, of course, some word combining characters like _ ).

+73
Jun 05 2018-12-12T00:
source share

To match individual characters, you can simply include them in the character class, either as literals, or through the \u03FB syntax.

Obviously, you often cannot list all valid characters in ideographic languages. In order for the regular expression to process Unicode characters according to their type or block of code, various other screens that are defined here are supported. See the “Unicode Support” section, in particular for references to the Character class and to the Unicode standard.

+7
Jun 05 '12 at 8:50
source share

To turn to NLS support and avoid adopting a special English character, we can use the template below ...

[a-zA-Z0-9 \ u0080- \ u9fff] * +

For a reference to the UTF code point: http://www.utf8-chartable.de/unicode-utf8-table.pl

Code snippet:

  String vowels = "అఆఇఈఉఊఋఌఎఏఐఒఓఔౠౡ"; String consonants = "కఖగఘఙచఛజఝఞటఠడఢణతథదధనపఫబభమయరఱలళవశషసహ"; String signsAndPunctuations = "కఁకంకఃకాకికీకుకూకృకౄకెకేకైకొకోకౌక్కౕకౖ"; String symbolsAndNumerals = "౦౧౨౩౪౫౬౭౮౯"; String engChinesStr = "ABC導字會"; Pattern ALPHANUMERIC_AND_SPACE_PATTERN_TELUGU = Pattern .compile("[a-zA-Z0-9 \\u0c00-\\u0c7f]*+"); System.out.println(ALPHANUMERIC_AND_SPACE_PATTERN_TELUGU.matcher(vowels) .matches()); Pattern ALPHANUMERIC_AND_SPACE_PATTERN_CHINESE = Pattern .compile("[a-zA-Z0-9 \\u4e00-\\u9fff]*+"); Pattern ENGLISH_ALPHANUMERIC_SPACE_AND_NLS_PATTERN = Pattern .compile("[a-zA-Z0-9 \\u0080-\\u9fff]*+"); System.out.println(ENGLISH_ALPHANUMERIC_SPACE_AND_NLS_PATTERN.matcher(engChinesStr) .matches()); 
+6
Jul 07 '15 at 10:04
source share
  • Java regex API works with char type
  • char type implicitly UTF-16
  • If you have UTF-8 data, you need to transcode it to UTF-16 at the input, if not already done.

Unicode is a universal character set, and UTF-8 can describe all of it (including control characters, punctuation, characters, letters, etc.). You need to be more specific about what you want to include and what you want to exclude. Java regular expressions use the \p{category} syntax to match category code pages. See Unicode standard for a list of categories.

If you want to identify and separate words in a sequence of ideographers, you will need to look at a more complex API. I would start with a BreakIterator type.

+3
Jun 05 2018-12-12T00:
source share



All Articles