Java Regular Expression does not recognize characters from other languages as word characters (i.e. \ w)

Question

Java Regular Expression does not recognize characters from other languages as word characters (i.e. \ w)

Let's say I have a word: "Ayavarav." The expression \w+ should fix the word, but the letter "ä" shortens the word in half. Instead of Ayavarav, I get Ayya. What is the correct regular expression for words containing these non-ascii letters?

+6

java regex parsing

jyriand Feb 09 '12 at 2:22

source share

1 answer

ruakh · Accepted Answer · 2012-02-09T03:04:55+0000

According to the documentation , \w matches only [a-zA-Z_0-9] unless you specify the UNICODE_CHARACTER_CLASS flag:

 Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS)

or paste a (?U) into the template:

 Pattern.compile("(?U)\\w+")

any of these requires JDK 1.7 (i.e. Java 7).

If you do not have Java 7, you can generalize \w to Unicode using \p{L} ("letter", for example [a-zA-Z] but not ASCII-specific) and \p{N} ("number "; for example, [0-9] , but not ASCII-specific):

 Pattern.compile("[\\p{L}_\\p{N}]+")

But it looks like you are looking for real words in the usual sense (as opposed to meaning in a programming language) and do not need the support of numbers and underscores? In this case, you can simply use \p{L} :

 Pattern.compile("\\p{L}+")

(By the way, curly braces are actually optional: you can write \pL instead of p{L} and \pN instead of \p{N} - but people usually include them anyway because they are required for multi-letter categories such as \p{Lu} is an uppercase letter.)

Java Regular Expression does not recognize characters from other languages ​​as word characters (i.e. \ w)

More articles:

Java Regular Expression does not recognize characters from other languages as word characters (i.e. \ w)