What you are looking for are Unicode properties.
eg. \p{L} is any letter from any language
Thus, the regular expression corresponding to such a Chinese word may be something like
\p{L}+
There are many such properties, for more details see regular-expressions.info
Another option is to use a modifier
Pattern.UNICODE_CHARACTER_CLASS
In Java 7, there is a new property Pattern.UNICODE_CHARACTER_CLASS that allows the Unicode version for predefined character classes see my answer here for more details and links p>
You can do something like this
Pattern p = Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS);
and \w will match all letters and all numbers from any language (and, of course, some word combining characters like _ ).
stema Jun 05 2018-12-12T00: 00Z
source share