We recently released the 'kebabs' package as part of Bioconductor 3.0. Although this package aims to provide sequence kernels for classification, regression, and other tasks, such as clustering similarities, the package also includes functions for efficiently calculating k-mer frequencies:
#installing kebabs: #source("http://bioconductor.org/biocLite.R") #biocLite(c("kebabs", "Biostrings")) library(kebabs) s1 <- DNAString("ATCGATCGATCGATCGATCGATCGACTGACTAGCTAGCTACGATCGACTG") s1 s2 <- DNAString(paste0(rep(s1, 200), collate="")) s2 sk13 <- spectrumKernel(k=13, normalized=FALSE) system.time(kmerFreq <- drop(getExRep(s1, sk13))) kmerFreq system.time(kmerFreq <- drop(getExRep(s2, sk13))) kmerFreq
So, you see that the k-mer frequencies are obtained as explicit vector functions of the standard (non-normalized) kernel of the spectrum with k = 13. This function is implemented in high-performance C ++ code that creates a tree of prefixes and takes into account only k-mers, which actually meet in sequence (at your request). You see that even with k = 13 and a sequence with tens of thousands of bases, the calculations only take a fraction of the second (19 ms on our 5-year-old Dell server). The above function also works for DNAStringSets, but in this case you must remove drop () to get the matrix from k-dimensional frequencies. The matrix is โโby default (the class is 'dgRMatrix'), but you can also ensure that the result is in a standard dense matrix format (however, still ignoring k-mers, which are not found at all in any of the sequences):
sv <- c(DNAStringSet(s1), DNAStringSet(s2)) system.time(kmerFreq <- getExRep(sv, sk13)) kmerFreq system.time(kmerFreq <- getExRep(sv, sk13, sparse=FALSE)) kmerFreq
How long can k-mers be, your system may depend. In our system, the limit seems to be k = 22 for DNA sequences. The same works for RNA and amino acid sequences. However, for the latter, the limits in terms of k are much lower, since the space of functions is obviously much larger for the same k.
#for the kebabs documentation please see: browseVignettes("kebabs")
Hope this helps. If you have further questions, let me know.
Regards, Ulrich