According to the documentation , \w matches only [a-zA-Z_0-9] unless you specify the UNICODE_CHARACTER_CLASS flag:
Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS)
or paste a (?U) into the template:
Pattern.compile("(?U)\\w+")
any of these requires JDK 1.7 (i.e. Java 7).
If you do not have Java 7, you can generalize \w to Unicode using \p{L} ("letter", for example [a-zA-Z] but not ASCII-specific) and \p{N} ("number "; for example, [0-9] , but not ASCII-specific):
Pattern.compile("[\\p{L}_\\p{N}]+")
But it looks like you are looking for real words in the usual sense (as opposed to meaning in a programming language) and do not need the support of numbers and underscores? In this case, you can simply use \p{L} :
Pattern.compile("\\p{L}+")
(By the way, curly braces are actually optional: you can write \pL instead of p{L} and \pN instead of \p{N} - but people usually include them anyway because they are required for multi-letter categories such as \p{Lu} is an uppercase letter.)
source share