Compare each * nd character of a text string

The problem is that I received a large text file. Let it be

a=c("atcgatcgatcgatcgatcgatcgatcgatcgatcg") 

I need to compare every third character in this text with a value (like 'c' ), and if true, I want to add 1 to counter i . I thought to use grep , but it seems that this function will not be used for my purpose. So I need your help or advice.

Moreover, I want to extract certain values ​​from this string into a vector. For example, I want to extract 4:10 characters, for example.

  a=c("atcgatcgatcgatcgatcgatcgatcgatcgatcg") [1] "gatcgatcga" 

Thanks in advance.

PS

I know that this is not the best idea to write a script I need in R, but I'm curious if I can write it accordingly.

+6
source share
3 answers

Edited to provide a quick solution for much larger lines:

If you have a very long string (of the order of millions of nucleotides), the lookbehind statement in my original answer (below) is too slow to be practical. In this case, use something more similar to the following: (1) split the string separately between each character; (2) uses characters to populate a three-row matrix; and then (3) extracts the characters in the third row of the matrix. It takes about 0.2 seconds to process a string of 3 million characters.

 ## Make a 3-million character long string a <- paste0(sample(c("a", "t", "c", "g"), 3e6, replace=TRUE), collapse="") ## Extract the third codon of each triplet n3 <- matrix(strsplit(a, "")[[1]], nrow=3)[3,] ## Check that it works sum(n3=="c") # [1] 250431 table(n3) # n3 # acgt # 250549 250431 249008 250012 

Original answer:

I could use substr() in both cases.

 ## Split into codons. (The "lookbehind assertion", "(?<=.{3})" matches at each ## inter-character location that preceded by three characters of any type.) codons <- strsplit(a, "(?<=.{3})", perl=TRUE)[[1]] # [1] "atc" "gat" "cga" "tcg" "atc" "gat" "cga" "tcg" "atc" "gat" "cga" "tcg" ## Extract 3rd nucleotide in each codon n3 <- sapply(codons, function(X) substr(X,3,3)) # atc gat cga tcg atc gat cga tcg atc gat cga tcg # "c" "t" "a" "g" "c" "t" "a" "g" "c" "t" "a" "g" ## Count the number of 'c's sum(n3=="c") # [1] 3 ## Extract nucleotides 4-10 substr(a, 4,10) # [1] "gatcgat" 
+7
source

This is a simple approach using R primitives:

 sum("c"==(strsplit(a,NULL))[[1]][c(FALSE,FALSE,TRUE)]) [1] 3 # this is the right answer. 

The boolean pattern c(FALSE,FALSE,TRUE) replicated as long as the input string and then used to index it. It can be customized to suit another element or for a longer length (for those who have extended codons).


Probably not effective enough for whole genomes, but ideal for everyday use.

+3
source

Compare every third character with "c" :

 grepl("^(.{2}c)*.{0,2}$", a) # [1] FALSE 

Extract characters 4 through 10:

 substr(a, 4, 10) # [1] "gatcgat" 
+1
source

Source: https://habr.com/ru/post/981077/


All Articles