Positive lookbehind for non-ASCII characters in R

I have an R function that tries to make up the first letter of each word "

proper = function(x){
  gsub("(?<=\\b)([[:alpha:]])", "\\U\\1", x, perl = TRUE)
}

This works very well, but when I have a word with a Maori macron, for example Māori, I get the wrong capital letter.

> proper("Māori")
[1] "MāOri"

Obviously, the RE engine thinks that a macro āis a word boundary. I do not know why.

+4
source share
2 answers

Since you are using the PCRE regular expression mechanism (included with perl=TRUE), you must pass the flag (*UCP)to the regular expression so that all abbreviations and word boundaries can detect the correct characters / locations in Unicode text:

proper = function(x){
  gsub("(*UCP)\\b([[:alpha:]])", "\\U\\1", x, perl = TRUE)
}
proper("Māori")
## [1] "Māori"

See R demo .

, \b lookbehind, .. (?<=\b)= \b.

+3

\b , [a-zA-Z0-9_], , Unicode .

, gsub R , .

:

(?<!\\S)([[:alpha:]])

, , āmori.

0

Source: https://habr.com/ru/post/1690165/


All Articles