Java Regular Expression does not recognize characters from other languages โ€‹โ€‹as word characters (i.e. \ w)

Let's say I have a word: "Ayavarav." The expression \w+ should fix the word, but the letter "รค" shortens the word in half. Instead of Ayavarav, I get Ayya. What is the correct regular expression for words containing these non-ascii letters?

+6
source share
1 answer

According to the documentation , \w matches only [a-zA-Z_0-9] unless you specify the UNICODE_CHARACTER_CLASS flag:

 Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS) 

or paste a (?U) into the template:

 Pattern.compile("(?U)\\w+") 

any of these requires JDK 1.7 (i.e. Java 7).

If you do not have Java 7, you can generalize \w to Unicode using \p{L} ("letter", for example [a-zA-Z] but not ASCII-specific) and \p{N} ("number "; for example, [0-9] , but not ASCII-specific):

 Pattern.compile("[\\p{L}_\\p{N}]+") 

But it looks like you are looking for real words in the usual sense (as opposed to meaning in a programming language) and do not need the support of numbers and underscores? In this case, you can simply use \p{L} :

 Pattern.compile("\\p{L}+") 

(By the way, curly braces are actually optional: you can write \pL instead of p{L} and \pN instead of \p{N} - but people usually include them anyway because they are required for multi-letter categories such as \p{Lu} is an uppercase letter.)

+12
source

Source: https://habr.com/ru/post/908000/


All Articles