How to find indices of nonzero elements in a large sparse matrix?

Question

How to find indices of nonzero elements in a large sparse matrix?

i has two square matrices (a, b) about 100000 x 100000 in size. I have to distinguish between these two matrices (c = ab). The resulting matrix 'c' is a sparse matrix. I want to find the indices of all nonzero elements. I have to perform this operation many times (> 100).

The simplest way is to use two cycles. But it is computationally intensive. Can you tell me any algorithm or package / library, preferably in R / python / c, to do this as quickly as possible?

+6

python algorithm r sparse-matrix

Netro Sep 09 '11 at 12:12

source share

6 answers

In R, if you use the Matrix and sparseMatrix to convert from a coordinate list to a sparse matrix, then you can convert back to column 3:

 TmpX <- as(M, "dgTMatrix") X3col <- matrix(c( TmpX@i , TmpX@j , TmpX@val ), ncol = 3)

This will give you the coordinates and values in a sparse matrix.

Depending on the location of non-zero entries in A and B, you may find it is much better to work with a coordinate list than with a sparse matrix representation (by the way, there are dozens of sparse matrix representations), since you can use the direct use of vectorized operations, rather than relying on Your sparse matrix pack for optimal performance. I tend to alternate between using COO or a sparse matrix in different languages, depending on how I get the fastest performance for the algorithm of interest.

Update 1: I did not know that your two matrices A and B are dense. As such, the easiest solution to finding non-zero entries in C is simply not to subtract first - just compare entries A and B. Logical comparison should be faster than subtraction. First find the entries A and B, where A != B , and then subtract only those entries. Then you just need to convert from vectorization of the indices in and B to their (row, col) representation. This is similar to ind2sub and sub2ind matlab - look at this R link for calculations.

+3

Iterator Sep 09 '11 at 12:54

source share

You can use c.nonzero() method:

 >>> from scipy.sparse import lil_eye >>> c = lil_eye((4, 10)) # as an example >>> c <4x10 sparse matrix of type '<type 'numpy.float64'>' with 4 stored elements in LInked List format> >>> c.nonzero() (array([0, 1, 2, 3], dtype=int32), array([0, 1, 2, 3], dtype=int32)) >>> import numpy as np >>> np.ascontiguousarray(c) array([ (0, 0) 1.0 (1, 1) 1.0 (2, 2) 1.0 (3, 3) 1.0], dtype=object)

You do not need to calculate the matrix c to determine the indices of nonzero elements in c = a - b ; you can do (a != b).nonzero() :

 >>> a = np.random.random_integers(2, size=(4,4)) >>> b = np.random.random_integers(2, size=(4,4)) >>> (a != b).nonzero() (array([0, 0, 1, 1, 1, 2, 3]), array([1, 2, 1, 2, 3, 2, 0])) >>> a - b array([[ 0, 1, 1, 0], [ 0, 1, -1, -1], [ 0, 0, 1, 0], [-1, 0, 0, 0]])

+2

jfs Sep 09 '11 at 12:25

source share

look numpy it has everything you ask for and much more!

See this for sparse matrix support.

+1

Fredrik pihl Sep 09 '11 at 12:15

source share

This code takes less than 0.1 s.

 m <- matrix(rpois(1000000,0.01),ncol=1000) m0 <- lapply(seq(NCOL(m)),function(x) which(m[,x] != 0))

EDIT: For sparse matrices of any size (which is suitable for memory).

DATA

 library(data.table) N <- 1e+5 n <- 1e+6 ta <- data.table(r=sample(seq(N), n,replace=TRUE), c=sample(seq(N), n,replace=TRUE), a=sample(1:20,n,replace=TRUE)) tb <- data.table(r=sample(seq(N), n,replace=TRUE), c=sample(seq(N), n,replace=TRUE), b=sample(1:20,n,replace=TRUE)) setkey(ta,r,c) setkey(tb,r,c)

THE CODE

 system.time(tw <- ta[tb][is.na(a)|is.na(b)|(ab != 0),list(r=r,c=c)])

+1

Wojciech sobala Sep 09 '11 at 13:27

source share

I did not time it, but the simplest code is

 all.indices<- which (C>0, arr.ind=T)

+1

Carl Witthoft Sep 09 '11 at 15:20

source share

David heffernan · Accepted Answer · 2011-09-09T13:21:22+0000

Since you have two dense matrices, a double for loop is the only option you have. You do not need a sparse matrix class at all, since you only want to know the list of indices (i,j) for which a[i,j] != b[i,j] .

In languages like R and Python, a double loop will not work well. I would probably write this in my own code for the double for loop and add indexes to the list object. But, no doubt, the wizards of the interpreted code (i.e. R, Python, etc.) know effective ways to do this without resorting to native encoding.

How to find indices of nonzero elements in a large sparse matrix?

More articles: