Massive row-worm CSR matrix

I want to iterate over the rows of a CSR matrix and divide each element by the sum of a row like this:

numpy divide the number of lines by line

My problem is that I am dealing with a large matrix: (96582, 350138)

And, applying an operation from a related message, it inflates my memory, since the returned matrix is ​​dense.

So here is my first attempt:

for row in counts: row = row / row.sum() 

Unfortunately, this does not affect the matrix at all, so I came up with a second idea for creating a new matrix of csr and concat rows using vstack:

 from scipy import sparse import time start_time = curr_time = time.time() mtx = sparse.csr_matrix((0, counts.shape[1])) for i, row in enumerate(counts): prob_row = row / row.sum() mtx = sparse.vstack([mtx, prob_row]) if i % 1000 == 0: delta_time = time.time() - curr_time total_time = time.time() - start_time curr_time = time.time() print('step: %i, total time: %i, delta_time: %i' % (i, total_time, delta_time)) 

This works well, but after some iterations, it gets slower and slower:

 step: 0, total time: 0, delta_time: 0 step: 1000, total time: 1, delta_time: 1 step: 2000, total time: 5, delta_time: 4 step: 3000, total time: 12, delta_time: 6 step: 4000, total time: 23, delta_time: 11 step: 5000, total time: 38, delta_time: 14 step: 6000, total time: 55, delta_time: 17 step: 7000, total time: 88, delta_time: 32 step: 8000, total time: 136, delta_time: 47 step: 9000, total time: 190, delta_time: 53 step: 10000, total time: 250, delta_time: 59 step: 11000, total time: 315, delta_time: 65 step: 12000, total time: 386, delta_time: 70 step: 13000, total time: 462, delta_time: 76 step: 14000, total time: 543, delta_time: 81 step: 15000, total time: 630, delta_time: 86 step: 16000, total time: 722, delta_time: 92 step: 17000, total time: 820, delta_time: 97 

Any suggestions? Any idea why vstack is getting slower and slower?

+5
source share
2 answers

vstack is an O(n) operation because it needs to allocate memory for the result, and then copy the contents of all arrays that you pass as arguments to the result array.

You can simply use multiply to do the operation:

 >>> res = counts.multiply(1 / counts.sum(1)) # multiply with inverse >>> res.todense() matrix([[ 0.33333333, 0. , 0.66666667], [ 0. , 0. , 1. ], [ 0.26666667, 0.33333333, 0.4 ]]) 

But it is also quite easy to use np.lib.stride_tricks.as_strided to perform the operation you need (relatively perfect). This as_strided function also allows you to perform more complex operations on the array (if for your case there is no method or function).

For example, using the csr scipy documentation example:

 >>> from scipy.sparse import csr_matrix >>> import numpy as np >>> row = np.array([0,0,1,2,2,2]) >>> col = np.array([0,2,2,0,1,2]) >>> data = np.array([1.,2,3,4,5,6]) >>> counts = csr_matrix( (data,(row,col)), shape=(3,3) ) >>> counts.todense() matrix([[ 1., 0., 2.], [ 0., 0., 3.], [ 4., 5., 6.]]) 

You can divide each row by the sum as follows:

 >>> row_start_stop = np.lib.stride_tricks.as_strided(counts.indptr, shape=(counts.shape[0], 2), strides=2*counts.indptr.strides) >>> for start, stop in row_start_stop: ... row = counts.data[start:stop] ... row /= row.sum() >>> counts.todense() matrix([[ 0.33333333, 0. , 0.66666667], [ 0. , 0. , 1. ], [ 0.26666667, 0.33333333, 0.4 ]]) 
+4
source

Edit

@MSeifert's answer is much more efficient, and this should be the right way. I think the notation counts[i, :] implies that the columns are being cut, which I did not understand. The documentation clearly states that these are really inefficient operations on csr_matrix. By the way, this is actually a good example.

Answer

Doc says line cutting is effective, I think you should do

 for i in range(counts.shape[0]): counts[i,:] /= counts[i,:].sum() 

This way you edit your matrix in place, it remains sparse, and you do not need to use vstack. I'm not sure if this is the most efficient operation, but at least you should not have memory problems, and there is no slowdown effect when calculating strings:

 import time() s = time.time() for i in range(counts.shape[0]): counts[i, :] /= (counts[i, :].sum() + 1) if i % 1000 == 0: e = time.time() if i > 0: print i, es s = time.time() 

Performances:

1000 6.00199794769 2000 6.02894091606 3000 7.44459486008 4000 7.10011601448 5000 6.16998195648 6000 7.79510307312 7000 7.00139117241 8000 7.37821507454 9000 7.28075814247 ...

Speeches for MSeifert's answer:

 row_start_stop = np.lib.stride_tricks.as_strided(counts.indptr, shape=(counts.shape[0], 2), strides=2*counts.indptr.strides) for i, (start, stop) in enumerate(row_start_stop): row = counts.data[start:stop] row /= row.sum() if i % 1000 == 0: e = time.time() if i > 0: print i,es s = time.time() 

1000 0.00735783576965 2000 0.0108380317688 3000 0.0102109909058 4000 0.0131571292877 5000 0.00670218467712 6000 0.00608897209167 7000 0.00663685798645 8000 0.0164499282837 9000 0.0061981678009 ... As for why using vstack slow, @MS answer is great.

+2
source

Source: https://habr.com/ru/post/1270320/


All Articles