Assigning groups using grepl with multiple inputs

Question

Assigning groups using grepl with multiple inputs

I have a dataframe:

df <- data.frame(name=c("john", "david", "callum", "joanna", "allison", "slocum", "lisa"), id=1:7) df name id 1 john 1 2 david 2 3 callum 3 4 joanna 4 5 allison 5 6 slocum 6 7 lisa 7

I have a vector containing a regular expression that I want to find in the df $ name variable:

 vec <- c("lis", "^jo", "um$")

The output I want to get is as follows:

  name id group 1 john 1 2 2 david 2 NA 3 callum 3 3 4 joanna 4 2 5 allison 5 1 6 slocum 6 3 7 lisa 7 1

I could do this by doing the following:

 df$group <- ifelse(grepl("lis", df$name), 1, ifelse(grepl("^jo", df$name), 2, ifelse(grepl("um$", df$name), 3, NA)

However, I want to do this directly from 'vec'. I am generating various values in vec reactively in a brilliant application. Can i assign groups by index in vec?

Further, if something like below happens, the group should be the first. for example, "Callum" has a value of TRUE for "all" and "um $", but should get group 1 here.

 vec <- c("all", "^jo", "um$")

+5

regex r

jalapic Jan 10 '16 at 6:49

source share

2 answers

Vector solution using rebus and stringi .

 library(rebus) library(stringi)

Create a regex that captures any of the values in vec .

 vec <- c("lis", "^jo", "um$") (rx <- or1(vec, capture = TRUE)) ## <regex> (lis|^jo|um$)

Matches a regular expression, then converts to a coefficient and an integer.

 matches <- stri_match_first_regex(df$name, rx)[, 2] df$group <- as.integer(factor(matches, levels = c("lis", "jo", "um")))

df now looks like this:

  name id group 1 john 1 2 2 david 2 NA 3 callum 3 3 4 joanna 4 2 5 allison 5 1 6 slocum 6 3 7 lisa 7 1

+3

Richie cotton Jan 10 '16 at 10:27

source share

Jota · Accepted Answer · 2016-01-10T07:18:34+0000

Here are a few options:

 df$group <- apply(Vectorize(grepl, "pattern")(vec, df$name), 1, function(ii) which(ii)[1]) # name id group #1 john 1 2 #2 david 2 NA #3 callum 3 3 #4 joanna 4 2 #5 allison 5 1 #6 slocum 6 3 #7 lisa 7 1

Use a named vector and merge it:

 names(vec) <- seq_along(vec) df <- merge(df, stack(Vectorize(grep, "pattern", SIMPLIFY=FALSE)(vec, df$name)), by.x="id", by.y="values", all.x = TRUE) df[!duplicated(df$id),] # to keep only the first match # id name ind #1 1 john 2 #2 2 david <NA> #3 3 callum 3 #4 4 joanna 2 #5 5 allison 1 #6 6 slocum 3 #7 7 lisa 1

A for the loop:

 df$group <- NA for ( i in rev(seq_along(vec))) { TFvec <- grepl(vec[i], df$name) df$group[TFvec] <- i } df # name id group #1 john 1 2 #2 david 2 NA #3 callum 3 3 #4 joanna 4 2 #5 allison 5 1 #6 slocum 6 3 #7 lisa 7 1

Or you can use outer with stri_match_first_regex from stringi

 library(stringi) match.mat <- outer(df$name, vec, stri_match_first_regex) df$group <- apply(match.mat, 1, function(ii) which(!is.na(ii))[1]) # [1] for first match in `vec` # name id group #1 john 1 2 #2 david 2 NA #3 callum 3 3 #4 joanna 4 2 #5 allison 5 1 #6 slocum 6 3 #7 lisa 7 1

Assigning groups using grepl with multiple inputs

More articles: