The average value of nonzero values ​​in a sparse matrix?

I am trying to calculate the average of nonzero values ​​in each row of a sparse row matrix. Using the middle matrix method does not:

>>> from scipy.sparse import csr_matrix >>> a = csr_matrix([[0, 0, 2], [1, 3, 8]]) >>> a.mean(axis=1) matrix([[ 0.66666667], [ 4. ]]) 

The following works, but is slow for large matrices:

 >>> import numpy as np >>> b = np.zeros(a.shape[0]) >>> for i in range(a.shape[0]): ... b[i] = a.getrow(i).data.mean() ... >>> b array([ 2., 4.]) 

Can someone tell me if there is a faster method?

+5
source share
3 answers

This is a typical problem where you can use numpy.bincount. For this, I used three functions:

 (x,y,z)=scipy.sparse.find(a) 

returns rows ( x ), columns ( y ), and values ​​( z ) of a sparse matrix. For instace x there is array([0, 1, 1, 1].

numpy.bincount(x) returns for each line number the number of unnecessary unnecessary elements.

numpy.bincount(x,wights=z) returns for each row the sum of nonzero elements.

Final working code:

 from scipy.sparse import csr_matrix a = csr_matrix([[0, 0, 2], [1, 3, 8]]) import numpy import scipy.sparse (x,y,z)=scipy.sparse.find(a) countings=numpy.bincount(x) sums=numpy.bincount(x,weights=z) averages=sums/countings print(averages) 

returns:

 [ 2. 4.] 
+4
source

With a CSR format matrix, you can make it even easier:

 sums = a.sum(axis=1).A1 counts = np.diff(a.indptr) averages = sums / counts 

String strings are directly supported, and the CSR format structure means that the difference between consecutive values ​​in the indptr array corresponds exactly to the number of nonzero elements in each row.

+5
source

I always like to sum values ​​on any axis you are interested in and divide by the total number of nonzero elements in the corresponding row / column.

Same:

 sp_arr = csr_matrix([[0, 0, 2], [1, 3, 8]]) col_avg = sp_arr.sum(0) / (sp_arr != 0).sum(0) row_avg = sp_arr.sum(1) / (sp_arr != 0).sum(1) print(col_avg) matrix([[ 1., 3., 5.]]) print(row_avg) matrix([[ 2.], [ 4.]]) 

Basically, you summarize the total value of all records along this axis and divide by the sum of True records, where the matrix! = 0 (which is the number of real records).

I find this approach less complex and simpler than other options.

+1
source

Source: https://habr.com/ru/post/1238109/


All Articles