I have a large csr_matrix, and I want to add extra rows and get a new csr_matrix with the same number of columns, but with a reduced number of rows. (Context: the matrix is a matrix of documents obtained from the sklearn CountVectorizer, and I want to be able to quickly combine documents according to the codes associated with these documents).
For a minimal example, this is my matrix:
import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse import vstack
row = np.array([0, 4, 1, 3, 2])
col = np.array([0, 2, 2, 0, 1])
dat = np.array([1, 2, 3, 4, 5])
A = csr_matrix((dat, (row, col)), shape=(5, 5))
print A.toarray()
[[1 0 0 0 0]
[0 0 3 0 0]
[0 5 0 0 0]
[4 0 0 0 0]
[0 0 2 0 0]]
No, let's say I need a new matrix B
in which the rows (1, 4) and (2, 3, 5) are combined by adding them, which looks something like this:
[[5 0 0 0 0]
[0 5 5 0 0]]
And it should be again in a sparse format (because the real data I'm working with is big). I tried to summarize the slices of the matrix, and then add it:
idx1 = [1, 4]
idx2 = [2, 3, 5]
A_sub1 = A[idx1, :].sum(axis=1)
A_sub2 = A[idx2, :].sum(axis=1)
B = vstack((A_sub1, A_sub2))
, , .
, . . ?