Matching and counting strings (k-mer DNA) in R

Question

Matching and counting strings (k-mer DNA) in R

I have a list of strings (DNA sequence), including A, T, C, G. I want to find all matches and paste into a table whose columns are a possible combination of these DNA alphabets (4 ^ k; "k" - the length of each match - K-mer - and must be specified by the user), and the lines are a number that matches the list in sequence.

Let's say my list includes 5 members:

DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA")

I want to set k=2 (2-mer), so a combination of 4^2=16 , including AA,AT,AC,AG,TA,TT,...

So my table will have 5 rows and 16 columns . I want to count the number of matches between my k-mers and list members.

My desired result: df:

 lstMemb AA AT AC AG TA TT TC ... 1 2 1 1 0 0 3 0 2 ... 3 4 5

Could you help me implement this in R?

+6

string regex r dna-sequence

Cina Oct 28 '14 at 4:07

source share

5 answers

Maybe it helps

  source("http://bioconductor.org/biocLite.R") biocLite("Biostrings") library(Biostrings) t(sapply(DNAlst, function(x){x1 <- DNAString(x) oligonucleotideFrequency(x1,2)})) # AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT #[1,] 2 1 0 1 1 0 0 1 1 0 0 0 0 0 1 3 #[2,] 5 1 1 2 0 1 1 0 2 0 0 1 2 0 1 0 #[3,] 0 0 0 2 0 0 0 0 0 1 0 0 1 0 1 1 #[4,] 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 #[5,] 1 0 0 1 2 0 2 0 0 2 0 0 0 1 0 0

Or, as @Arun suggested, first convert list to vector

  oligonucleotideFrequency(DNAStringSet(unlist(DNAlst)), 2L) # AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT #[1,] 2 1 0 1 1 0 0 1 1 0 0 0 0 0 1 3 #[2,] 5 1 1 2 0 1 1 0 2 0 0 1 2 0 1 0 #[3,] 0 0 0 2 0 0 0 0 0 1 0 0 1 0 1 1 #[4,] 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 #[5,] 1 0 0 1 2 0 2 0 0 2 0 0 0 1 0 0

+6

akrun Oct 28 '14 at 4:24

source share

My answer was not as quick as @bartektartanus. However, it is also pretty fast, and I wrote the code ...: D

The positive side of my code compared to others:

No need to set the failed version of stri_count_fixed
The stringi package will stringi be very slow for large k-mers, since it should generate all possible combinations for the template and subsequently check their existence in the data and calculate how many times it appears.
It also works for long single and multiple sequences with the same output very quickly.
Instead of creating a template string, you can put the value of k .
If you run oligonucleotideFrequency with k greater than 12 in a large sequence, the function freezes for excess memory usage and R restarts, and my function works pretty fast.

My code

 sequence_kmers <- function(sequence, k){ k_mers <- lapply(sequence,function(x){ seq_loop_size <- length(DNAString(x))-k+1 kmers <- sapply(1:seq_loop_size, function(z){ y <- z + k -1 kmer <- substr(x=x, start=z, stop=y) return(kmer) }) return(kmers) }) uniq <- unique(unlist(k_mers)) ind <- t(sapply(k_mers, function(x){ tabulate(match(x, uniq), length(uniq)) })) colnames(ind) <- uniq return(ind) }

I use the Biostrings package only for counting databases ... you can use other parameters, such as stringi for counting ... if you delete all the code below k_mers lapply and return(k_mers) , it only returns a list ... of all k-mers with corresponding repeating vectors

`sequence` Here is the sequence 1000bp

 #same output for 1 or multiple sequences > sequence_kmers(sequence,4)[,1:10] GTCT TCTG CTGA TGAA GAAC AACG ACGC CGCG GCGA CGAG 4 4 3 4 4 8 6 4 5 5 > sequence_kmers(c(sequence,sequence),4)[,1:10] GTCT TCTG CTGA TGAA GAAC AACG ACGC CGCG GCGA CGAG [1,] 4 4 3 4 4 8 6 4 5 5 [2,] 4 4 3 4 4 8 6 4 5 5

Tests performed using my function:

 #super fast for 1 sequence > system.time({sequence_kmers(sequence,13)}) usuário sistema decorrido 0.08 0.00 0.08 #works fast for 1 sequence or 50 sequences of 1000bps > system.time({sequence_kmers(rep(sequence,50),4)}) user system elapsed 3.61 0.00 3.61 #same speed for 3-mers or 13-mers > system.time({sequence_kmers(rep(sequence,50),13)}) user system elapsed 3.63 0.00 3.62

Tests performed using Biostrings :

 #Slow 1 sequence 12-mers > system.time({oligonucleotideFrequency(DNAString(sequence),12)}) user system elapsed 150.11 1.14 151.37 #Biostrings package freezes for a single sequence of 13-mers > system.time({oligonucleotideFrequency(sequence,13)}) freezes, used all my 8gb RAM

+4

Rodrigo de alexandre Nov 27 '14 at 21:15

source share

We recently released the 'kebabs' package as part of Bioconductor 3.0. Although this package aims to provide sequence kernels for classification, regression, and other tasks, such as clustering similarities, the package also includes functions for efficiently calculating k-mer frequencies:

 #installing kebabs: #source("http://bioconductor.org/biocLite.R") #biocLite(c("kebabs", "Biostrings")) library(kebabs) s1 <- DNAString("ATCGATCGATCGATCGATCGATCGACTGACTAGCTAGCTACGATCGACTG") s1 s2 <- DNAString(paste0(rep(s1, 200), collate="")) s2 sk13 <- spectrumKernel(k=13, normalized=FALSE) system.time(kmerFreq <- drop(getExRep(s1, sk13))) kmerFreq system.time(kmerFreq <- drop(getExRep(s2, sk13))) kmerFreq

So, you see that the k-mer frequencies are obtained as explicit vector functions of the standard (non-normalized) kernel of the spectrum with k = 13. This function is implemented in high-performance C ++ code that creates a tree of prefixes and takes into account only k-mers, which actually meet in sequence (at your request). You see that even with k = 13 and a sequence with tens of thousands of bases, the calculations only take a fraction of the second (19 ms on our 5-year-old Dell server). The above function also works for DNAStringSets, but in this case you must remove drop () to get the matrix from k-dimensional frequencies. The matrix is by default (the class is 'dgRMatrix'), but you can also ensure that the result is in a standard dense matrix format (however, still ignoring k-mers, which are not found at all in any of the sequences):

 sv <- c(DNAStringSet(s1), DNAStringSet(s2)) system.time(kmerFreq <- getExRep(sv, sk13)) kmerFreq system.time(kmerFreq <- getExRep(sv, sk13, sparse=FALSE)) kmerFreq

How long can k-mers be, your system may depend. In our system, the limit seems to be k = 22 for DNA sequences. The same works for RNA and amino acid sequences. However, for the latter, the limits in terms of k are much lower, since the space of functions is obviously much larger for the same k.

 #for the kebabs documentation please see: browseVignettes("kebabs")

Hope this helps. If you have further questions, let me know.

Regards, Ulrich

+4

UBod Nov 28 '14 at 11:09

source share

Another way to do this:

 DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA","ACACACACACCA") len <- 4 stri_sub_fun <- function(x) table(stri_sub(x,1:(stri_length(x)-len+1),length = len)) sapply(DNAlst, stri_sub_fun) [[1]] AAAC AACT ACTG ATTT CAAA CTGA GATT TGAT TTTT 1 1 1 1 1 1 1 1 1 [[2]] AAAA AAAG AAAT AAGT AATA ACCG AGTA ATAC ATGA GAAA GATG GTAA TAAA TACC TGAA 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [[3]] ATGC ATTA TATG TTAT 1 1 1 1 [[4]] TGGA 1 [[5]] ATCA CATC CGCA CGCG GCAT GCGC TCAA 1 1 1 1 1 1 1 [[6]] ACAC ACCA CACA CACC 4 1 3 1

+2

user4415867 Jan 03 '15 at 20:36

source share

bartektartanus · Accepted Answer · 2014-10-28T15:45:53+0000

If you are looking for speed, the obvious solution is stringi . There is a function stri_count_fixed to count patterns. And now, check the code and benchmark!

 DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA") dna <- stri_paste(rep(c("A","C","G","T"),each=4),c("A","C","G","T")) result <- t(sapply(DNAlst, stri_count_fixed,pattern=dna,overlap=TRUE)) colnames(result) <- dna result AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT [1,] 2 1 0 1 1 0 0 1 1 0 0 0 0 0 1 3 [2,] 5 1 1 2 0 1 1 0 2 0 0 1 2 0 1 0 [3,] 0 0 0 2 0 0 0 0 0 1 0 0 1 0 1 1 [4,] 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 [5,] 1 0 0 1 2 0 2 0 0 2 0 0 0 1 0 0 fstri <- function(x){ t(sapply(x, stri_count_fixed,dna,T)) } fbio <- function(x){ t(sapply(x, function(x){x1 <- DNAString(x); oligonucleotideFrequency(x1,2)})) } all(fstri(DNAlst)==fbio(DNAlst)) #results are the same [1] TRUE longDNA <- sample(DNAlst,100,T) microbenchmark(fstri(longDNA),fbio(longDNA)) Unit: microseconds expr min lq mean median uq max neval fstri(longDNA) 689.378 738.184 825.3014 766.862 793.134 6027.039 100 fbio(longDNA) 118371.825 125552.401 129543.6585 127245.489 129165.711 359335.294 100 127245.489/766.862 ## [1] 165.9301

Ca is 165 times faster :)

Matching and counting strings (k-mer DNA) in R

The positive side of my code compared to others:

My code

sequence Here is the sequence 1000bp

More articles:

`sequence` Here is the sequence 1000bp