Is there a faster way to subset a sparse matrix than "[??

I support the seqMeta package and have been looking for ideas on how to speed up the bottleneck of splitting a large matrix into smaller pieces many times.

Background

The seqMeta package is used to analyze genetic data. So you have a group of items (n_subject) and a series of genetic markers (n_snps). This results in an n_subject x n_snp (Z) matrix. There is also a data frame that tells you which snps are grouped together for analysis (usually these snps contain this gene).

While Z can be large, it is quite sparse. Usually less than 10%, and sometimes about 2%, the values ​​are non-zero. Presenting the matrix matrix seems like an obvious choice to save space.

Current project: nsubjects ~ 15,000 and nsnps ~ 2 million, with over 200,000 splits.

As the size of the data continues to grow, I find that the factor for time limitation is the number of groupings, not the actual size of the data. (See the example below: runtime is a linear function of n_splits for the same data)

Simplified example

library(Matrix)

seed(1)

n_subjects <- 1e3
n_snps <- 1e5
sparcity <- 0.05


n <- floor(n_subjects*n_snps*sparcity) 

# create our simulated data matrix
Z <- Matrix(0, nrow = n_subjects, ncol = n_snps, sparse = TRUE)
pos <- sample(1:(n_subjects*n_snps), size = n, replace = FALSE)
vals <- rnorm(n)
Z[pos] <- vals

# create the data frame on how to split
# real data set the grouping size is between 1 and ~1500
n_splits <- 500
sizes <- sample(2:20, size = n_splits, replace = TRUE)  
lkup <- data.frame(gene_name=rep(paste0("g", 1:n_splits), times = sizes),
                   snps = sample(n_snps, size = sum(sizes)))

# simple function that gets called on the split
# the real function creates a cols x cols dense upper triangular matrix
# similar to a covariance matrix
simple_fun <- function(Z, cols) {sum(Z[ , cols])}

# split our matrix based look up table
system.time(
res <- tapply(lkup[ , "snps"], lkup[ , "gene_name"], FUN=simple_fun, Z=Z, simplify = FALSE)
)

##    user  system elapsed 
##    3.21    0.00    3.21  

n_splits <- 1000
sizes <- sample(2:20, size = n_splits, replace = TRUE)  
lkup <- data.frame(gene_name=rep(paste0("g", 1:n_splits), times = sizes),
                   snps = sample(n_snps, size = sum(sizes)))

# split our matrix based look up table
system.time(
res <- tapply(lkup[ , "snps"], lkup[ , "gene_name"], FUN=simple_fun, Z=Z, simplify = FALSE)
)

##    user  system elapsed 
##    6.38    0.00    6.38

n_splits <- 5000
sizes <- sample(2:20, size = n_splits, replace = TRUE)  
lkup <- data.frame(gene_name=rep(paste0("g", 1:n_splits), times = sizes),
                   snps = sample(n_snps, size = sum(sizes)))

# split our matrix based look up table
system.time(
res <- tapply(lkup[ , "snps"], lkup[ , "gene_name"], FUN=simple_fun, Z=Z, simplify = FALSE)
)

##    user  system elapsed 
##   31.65    0.00   31.66

Question: Is there a faster way to subset a matrix than '['? Or did others come up to me missing?

+4
source share

Source: https://habr.com/ru/post/1655669/


All Articles