Extract a string of words between two specific words in R

I have the following line: "PRODUCT colgate good but not goodOKAY"

I want to extract all words between PRODUCT and OKAY

+15
source share
5 answers

This can be done using sub :

 s <- "PRODUCT colgate good but not goodOKAY" sub(".*PRODUCT *(.*?) *OKAY.*", "\\1", s) 

giving:

 [1] "colgate good but not good" 

No packages.

Here is a regular expression visualization:

 .*PRODUCT *(.*?) *OKAY.* 

Regular expression visualization

Demo version of Debuggex

+28
source
 x = "PRODUCT colgate good but not goodOKAY" library(stringr) str_extract(string = x, pattern = perl("(?<=PRODUCT).*(?=OKAY)")) 

(?<=PRODUCT) - Take a look behind PRODUCT

.* matches all but newlines.

(?=OKAY) - Look forward to match OKAY .

I must add that you do not need the stingr package for this, the basic functions of sub and gsub work fine. I use stringr for syntax consistency: I retrieve, replace, discover, etc. Function names are predictable and understandable, and the arguments are in sequential order. I use stringr because it saves me having to go to the documentation every time.

+17
source

You can use gsub :

 vec <- "PRODUCT colgate good but not goodOKAY" gsub(".*PRODUCT\\s*|OKAY.*", "", vec) # [1] "colgate good but not good" 
+16
source

You can use the rm_between function from the qdapRegex package. It takes the line and the left and right borders as follows:

 x <- "PRODUCT colgate good but not goodOKAY" library(qdapRegex) rm_between(x, "PRODUCT", "OKAY", extract=TRUE) ## [[1]] ## [1] "colgate good but not good" 
+10
source

You can use unglue package:

 library(unglue) x <- "PRODUCT colgate good but not goodOKAY" unglue_vec(x, "PRODUCT {out}OKAY") #> [1] "colgate good but not good" 
+1
source

Source: https://habr.com/ru/post/981842/


All Articles