Lower case Some words R

I need to convert certain words to lowercase. I work with a list of movie titles, where prepositions and articles are usually lowercase unless they are the first word in the title. If I have a vector:

movies = c('The Kings Of Summer', 'The Words', 'Out Of The Furnace', 'Me And Earl And The Dying Girl')

I need this:

movies_updated = c('The Kings of Summer', 'The Words', 'Out of the Furnace', 'Me and Earl and the Dying Girl')

Is there an elegant way to do this without using the long gsub() series, as in:

 movies_updated = gsub(' In ', ' in ', movies) movies_updated = gsub(' In', ' in', movies_updated) movies_updated = gsub(' Of ', ' of ', movies) movies_updated = gsub(' Of', ' of', movies_updated) movies_updated = gsub(' The ', ' the ', movies) movies_updated = gsub(' the', ' the', movies_updated) 

And so on.

+5
source share
3 answers

In fact, it seems that you are interested in converting the text to the title of the case . This can be easily achieved with the stringi package, as shown below:

 >> stringi::stri_trans_totitle(c('The Kings of Summer', 'The Words', 'Out of the Furnace')) [1] "The Kings Of Summer" "The Words" "Out Of The Furnace" 

An alternative approach involves using the toTitleCase function, available in the tools package

 >> tools::toTitleCase(c('The Kings of Summer', 'The Words', 'Out of the Furnace')) [1] "The Kings of Summer" "The Words" "Out of the Furnace" 
+10
source

Although I like @Konrad's answer for its brevity, I will offer an alternative that is more literal and tame.

 movies = c('The Kings Of Summer', 'The Words', 'Out Of The Furnace', 'Me And Earl And The Dying Girl') gr <- gregexpr("(?<!^)\\b(of|in|the)\\b", movies, ignore.case = TRUE, perl = TRUE) mat <- regmatches(movies, gr) regmatches(movies, gr) <- lapply(mat, tolower) movies # [1] "The Kings of Summer" "The Words" # [3] "Out of the Furnace" "Me And Earl And the Dying Girl" 

Regular expression tricks:

  • (?<!^) ensures that we do not match the word at the beginning of the line. Without this, the first The for films 1 and 2 will go down.
  • \\b sets the word boundaries, so in in the middle of Dying will not match. This is a little more stable than using space because hyphens, commas, etc. They will not be spaces, but indicate the beginning / end of a word.
  • (of|in|the) matches any of of , in or The . Additional patterns can be added with split pipes | .

Once identified, it is as simple as replacing them with lower versions.

+7
source

Another example of how to turn certain words into lowercase with gsub (with the PCRE regular expression):

 movies = c('The Kings Of Summer', 'The Words', 'Out Of The Furnace', 'Me And Earl And The Dying Girl') gsub("(?!^)\\b(Of|In|The)\\b", "\\L\\1", movies, perl=TRUE) 

See R demo

More details

  • (?!^) - not at the beginning of the line (it doesnโ€™t matter if we use lookahead or lookbehind here, since the template inside is a zero-width statement)
  • \\b - find the word boundary
  • (Of|In|The) - capture Of or In or The into a group 1
  • \\b - make sure that the final word boundary exists.

The replacement contains a subscript operator \L , which turns all characters in the first value of the backlink (text written in group 1) into lowercase.

Note that this may be more flexible than using tools::toTitleCase . Code part that contains specific lowercase words:

 ## These should be lower case except at the beginning (and after :) lpat <- "^(a|an|and|are|as|at|be|but|by|en|for|if|in|is|nor|not|of|on|or|per|so|the|to|v[.]?|via|vs[.]?|from|into|than|that|with)$" 

If you just need to apply the lower scale and not care about other logic in the function, it may be enough to add these alternatives (do not use ^ and $ bindings) in the regular expression at the top of the message.

+3
source

Source: https://habr.com/ru/post/1266252/


All Articles