Sum lines in scipy.sparse.csr_matrix

I have a large csr_matrix, and I want to add extra rows and get a new csr_matrix with the same number of columns, but with a reduced number of rows. (Context: the matrix is ​​a matrix of documents obtained from the sklearn CountVectorizer, and I want to be able to quickly combine documents according to the codes associated with these documents).

For a minimal example, this is my matrix:

import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse import vstack

row = np.array([0, 4, 1, 3, 2])
col = np.array([0, 2, 2, 0, 1])
dat = np.array([1, 2, 3, 4, 5])
A = csr_matrix((dat, (row, col)), shape=(5, 5))
print A.toarray()

[[1 0 0 0 0]
 [0 0 3 0 0]
 [0 5 0 0 0]
 [4 0 0 0 0]
 [0 0 2 0 0]]

No, let's say I need a new matrix Bin which the rows (1, 4) and (2, 3, 5) are combined by adding them, which looks something like this:

[[5 0 0 0 0]
 [0 5 5 0 0]]

And it should be again in a sparse format (because the real data I'm working with is big). I tried to summarize the slices of the matrix, and then add it:

idx1 = [1, 4]
idx2 = [2, 3, 5]
A_sub1 = A[idx1, :].sum(axis=1)
A_sub2 = A[idx2, :].sum(axis=1)
B = vstack((A_sub1, A_sub2))

, , .

, . . ?

+2
2

, , . :

>>> S = np.array([[1, 0, 0, 1, 0,], [0, 1, 1, 0, 1]])
>>> np.dot(S, A.toarray())
array([[5, 0, 0, 0, 0],
       [0, 5, 5, 0, 0]])
>>>

. , , row:

col = range(5)
row = [0, 1, 1, 0, 1]
dat = [1, 1, 1, 1, 1]
S = csr_matrix((dat, (row, col)), shape=(2, 5))
result = S * A
# check that the result is another sparse matrix
print type(result)
# check that the values are the ones we want
print result.toarray()

:

<class 'scipy.sparse.csr.csr_matrix'>
[[5 0 0 0 0]
 [0 5 5 0 0]]

, row S.

+4

:

idx1 = [0, 3]       # rows 1 and 4
idx2 = [1, 2, 4]    # rows 2,3 and 5

A_sub1 A_sub2 axis=0:

A_sub1 = csr_matrix(A[idx1, :].sum(axis=0))
A_sub2 = csr_matrix(A[idx2, :].sum(axis=0))
B = vstack((A_sub1, A_sub2))
B.toarray()
array([[5, 0, 0, 0, 0],
       [0, 5, 5, 0, 0]])

, , A[idx, :].sum(axis=0) - @Mr_E, , .

, axis=0 np.vstack ( scipy.sparse.vstack):

A_sub1 = A[idx1, :].sum(axis=0)
A_sub2 = A[idx2, :].sum(axis=0)
np.vstack((A_sub1, A_sub2))

:

matrix([[5, 0, 0, 0, 0],
        [0, 5, 5, 0, 0]])
+1

Source: https://habr.com/ru/post/1686574/


All Articles