I want to iterate over the rows of a CSR matrix and divide each element by the sum of a row like this:
numpy divide the number of lines by line
My problem is that I am dealing with a large matrix: (96582, 350138)
And, applying an operation from a related message, it inflates my memory, since the returned matrix is ββdense.
So here is my first attempt:
for row in counts: row = row / row.sum()
Unfortunately, this does not affect the matrix at all, so I came up with a second idea for creating a new matrix of csr and concat rows using vstack:
from scipy import sparse import time start_time = curr_time = time.time() mtx = sparse.csr_matrix((0, counts.shape[1])) for i, row in enumerate(counts): prob_row = row / row.sum() mtx = sparse.vstack([mtx, prob_row]) if i % 1000 == 0: delta_time = time.time() - curr_time total_time = time.time() - start_time curr_time = time.time() print('step: %i, total time: %i, delta_time: %i' % (i, total_time, delta_time))
This works well, but after some iterations, it gets slower and slower:
step: 0, total time: 0, delta_time: 0 step: 1000, total time: 1, delta_time: 1 step: 2000, total time: 5, delta_time: 4 step: 3000, total time: 12, delta_time: 6 step: 4000, total time: 23, delta_time: 11 step: 5000, total time: 38, delta_time: 14 step: 6000, total time: 55, delta_time: 17 step: 7000, total time: 88, delta_time: 32 step: 8000, total time: 136, delta_time: 47 step: 9000, total time: 190, delta_time: 53 step: 10000, total time: 250, delta_time: 59 step: 11000, total time: 315, delta_time: 65 step: 12000, total time: 386, delta_time: 70 step: 13000, total time: 462, delta_time: 76 step: 14000, total time: 543, delta_time: 81 step: 15000, total time: 630, delta_time: 86 step: 16000, total time: 722, delta_time: 92 step: 17000, total time: 820, delta_time: 97
Any suggestions? Any idea why vstack is getting slower and slower?