Decapitalize people's names (accounting "and -)

I have a vector of (human) names, all in capitals:

names <- c("FRIEDRICH SCHILLER", "FRANK O'HARA", "HANS-CHRISTIAN ANDERSEN") 

To decapitalize (use only the first letters), I used

 simpleDecap <- function(x) { s <- strsplit(x, " ")[[1]] paste0(substring(s, 1,1), tolower(substring(s, 2)), collapse=" ") } sapply(names, simpleDecap, USE.NAMES=FALSE) # [1] "Friedrich Schiller" "Frank O'hara" "Hans-christian Andersen" 

But I also want to consider for ' and - . Using s <- strsplit(x, " |\\'|\\-")[[1]] , of course, finds the correct letters, but then - lost as a result of the collapse of ' and - . Therefore, I tried

 simpleDecap2 <- function(x) { for (char in c(" ", "\\-", "\\'")){ s <- strsplit(x, char)[[1]] x <-paste0(substring(s, 1,1), tolower(substring(s, 2)), collapse=char) } return x } 

sapply (names, simpleDecap, USE.NAMES = FALSE)

but this is even worse, of course, since the results are split one by one:

 sapply(names, simpleDecap2, USE.NAMES=FALSE) # [1] "Friedrich schiller" "Frank o'Hara" "Hans-christian andersen" 

I think the correct approach breaks into s <- strsplit(x, " |\\'|\\-")[[1]] , but the problem is paste= .

+5
source share
2 answers

This seems to work using regular expressions compatible with Perl:

 gsub("\\b(\\w)([\\w]+)", "\\1\\L\\2", names, perl = TRUE) 

\L converts the next matching group to lowercase.

+6
source

Although I agree that Perl regexp is the best solution, the simpleDecap2 approach simpleDecap2 not so far from work.

 simpleDecap3 <- function(x) { x <- tolower(x) for (char in c(" ", "-", "'")){ s <- strsplit(x, char)[[1]] x <-paste0(toupper(substring(s, 1,1)), substring(s, 2), collapse=char) } x } 

That is, turn the entire name into lowercase, and then capitalize the first letter after β€œ,” - β€œorβ€œ. ”Not as pretty as a regular expression, and most likely not as reliable, but it runs with a few changes from your source code.

0
source

Source: https://habr.com/ru/post/1232182/


All Articles