Determine the continuous occurrence of stretching certain letters in a string using R

Question

Determine the continuous occurrence of stretching certain letters in a string using R

I would like to determine if the column of the line in the data frame below repeats the letters "V" or "G" at least 5 times within the first 20 characters of the line.

Sample data:

data = data.frame(class = c('a','b','C'), string = c("ASADSASAVVVVGVGGGSDASSSDDDFGDFGHFGHFGGGGGDDFFDDFGDFGTYJ", "AWEERTGVTHRGEFGDFSDFSGGGGGGDAWSDFAASDADAADWERWEQWD", "GRTVVGGVVVGGSWERGERVGEGDDFASDGGVQWEQWEQWERERYRYER"))

For example, the line in the first line has “VVVVG” in the first 20 character positions. Similarly, the line in the third line has "VVGGV".

 data # class string #1 a ASADSASAVVVVGVGGGSDASSSDDDFGDFGHFGHFGGGGGDDFFDDFGDFGTYJ #2 b AWEERTGVTHRGEFGDFSDFSGGGGGGDAWSDFAASDADAADWERWEQWD #3 C GRTVVGGVVVGGSWERGERVGEGDDFASDGGVQWEQWEQWERERYRYER

The desired result should look like this:

 # class string result # 1 a ASADSASAVVVVGVGGGSDASSSDDDFGDFGHFGHFGGGGGDDFFDDFGDFGTYJ TRUE # 2 b AWEERTGVTHRGEFGDFSDFSGGGGGGDAWSDFAASDADAADWERWEQWD FALSE # 3 C GRTVVGGVVVGGSWERGERVGEGDDFASDGGVQWEQWEQWERERYRYER TRUE

+6

r substr stringr

Veerendra gadekar Jun 04 '15 at 14:50

source share

2 answers

Another option, without substr :

 within(data, result<-grepl('^.{,15}[VG]{5,}', string))

+4

Matthew plourde Jun 04 '15 at 15:05

source share

Brodieg · Accepted Answer · 2015-06-04T14:56:15+0000

Like Akrun's

 transform(data, result=grepl("[VG]{5,}", substr(string, 1, 20)))

Gives out

  class string result 1 a ASADSASAVVVVGVGGGSDASSSDDDFGDFGHFGHFGGGGGDDFFDDFGDFGTYJ TRUE 2 b AWEERTGVTHRGEFGDFSDFSGGGGGGDAWSDFAASDADAADWERWEQWD FALSE 3 C GRTVVGGVVVGGSWERGERVGEGDDFASDGGVQWEQWEQWERERYRYER TRUE

Here we use grep in combination with a character class that matches either "G" or "V" ( [VG] ) repeated 5 or more times ( {5, } ). transform simply creates a new data frame with added or modified columns.

EDIT: some breakpoints versus Matthew's creative answer:

 set.seed(1) string <- vapply( replicate(1e5, sample(c("V", "G", "A", "S"), sample(20:300, 1), rep=T)), paste0, character(1L), collapse="" ) library(microbenchmark) microbenchmark( grepl("[VG]{5,}", substr(string, 1, 20)), grepl("^.{,15}[VG]{5,}", string), times=10 )

It produces:

 Unit: milliseconds expr min lq mean grepl("[VG]{5,}", substr(string, 1, 20)) 131.6668 131.8343 133.6644 grepl("^.{,15}[VG]{5,}", string) 299.7326 300.4416 302.5065

Not quite sure what to expect, but I think it makes sense since substr very easy to use. Times are very close if the pattern has 5 repetitions near the front of the line.

Determine the continuous occurrence of stretching certain letters in a string using R

More articles: