According to the documentation , \w
matches only [a-zA-Z_0-9]
unless you specify the UNICODE_CHARACTER_CLASS
flag:
Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS)
or paste a (?U)
into the template:
Pattern.compile("(?U)\\w+")
any of these requires JDK 1.7 (i.e. Java 7).
If you do not have Java 7, you can generalize \w
to Unicode using \p{L}
("letter", for example [a-zA-Z]
but not ASCII-specific) and \p{N}
("number "; for example, [0-9]
, but not ASCII-specific):
Pattern.compile("[\\p{L}_\\p{N}]+")
But it looks like you are looking for real words in the usual sense (as opposed to meaning in a programming language) and do not need the support of numbers and underscores? In this case, you can simply use \p{L}
:
Pattern.compile("\\p{L}+")
(By the way, curly braces are actually optional: you can write \pL
instead of p{L}
and \pN
instead of \p{N}
- but people usually include them anyway because they are required for multi-letter categories such as \p{Lu}
is an uppercase letter.)
source share