Determine the continuous occurrence of stretching certain letters in a string using R

I would like to determine if the column of the line in the data frame below repeats the letters "V" or "G" at least 5 times within the first 20 characters of the line.

Sample data:

data = data.frame(class = c('a','b','C'), string = c("ASADSASAVVVVGVGGGSDASSSDDDFGDFGHFGHFGGGGGDDFFDDFGDFGTYJ", "AWEERTGVTHRGEFGDFSDFSGGGGGGDAWSDFAASDADAADWERWEQWD", "GRTVVGGVVVGGSWERGERVGEGDDFASDGGVQWEQWEQWERERYRYER")) 

For example, the line in the first line has “VVVVG” in the first 20 character positions. Similarly, the line in the third line has "VVGGV".

 data # class string #1 a ASADSASAVVVVGVGGGSDASSSDDDFGDFGHFGHFGGGGGDDFFDDFGDFGTYJ #2 b AWEERTGVTHRGEFGDFSDFSGGGGGGDAWSDFAASDADAADWERWEQWD #3 C GRTVVGGVVVGGSWERGERVGEGDDFASDGGVQWEQWEQWERERYRYER 

The desired result should look like this:

 # class string result # 1 a ASADSASAVVVVGVGGGSDASSSDDDFGDFGHFGHFGGGGGDDFFDDFGDFGTYJ TRUE # 2 b AWEERTGVTHRGEFGDFSDFSGGGGGGDAWSDFAASDADAADWERWEQWD FALSE # 3 C GRTVVGGVVVGGSWERGERVGEGDDFASDGGVQWEQWEQWERERYRYER TRUE 
+6
source share
2 answers

Like Akrun's

 transform(data, result=grepl("[VG]{5,}", substr(string, 1, 20))) 

Gives out

  class string result 1 a ASADSASAVVVVGVGGGSDASSSDDDFGDFGHFGHFGGGGGDDFFDDFGDFGTYJ TRUE 2 b AWEERTGVTHRGEFGDFSDFSGGGGGGDAWSDFAASDADAADWERWEQWD FALSE 3 C GRTVVGGVVVGGSWERGERVGEGDDFASDGGVQWEQWEQWERERYRYER TRUE 

Here we use grep in combination with a character class that matches either "G" or "V" ( [VG] ) repeated 5 or more times ( {5, } ). transform simply creates a new data frame with added or modified columns.


EDIT: some breakpoints versus Matthew's creative answer:

 set.seed(1) string <- vapply( replicate(1e5, sample(c("V", "G", "A", "S"), sample(20:300, 1), rep=T)), paste0, character(1L), collapse="" ) library(microbenchmark) microbenchmark( grepl("[VG]{5,}", substr(string, 1, 20)), grepl("^.{,15}[VG]{5,}", string), times=10 ) 

It produces:

 Unit: milliseconds expr min lq mean grepl("[VG]{5,}", substr(string, 1, 20)) 131.6668 131.8343 133.6644 grepl("^.{,15}[VG]{5,}", string) 299.7326 300.4416 302.5065 

Not quite sure what to expect, but I think it makes sense since substr very easy to use. Times are very close if the pattern has 5 repetitions near the front of the line.

+5
source

Another option, without substr :

 within(data, result<-grepl('^.{,15}[VG]{5,}', string)) 
+4
source

Source: https://habr.com/ru/post/988522/


All Articles