Effectively subtracts a vector from a matrix (Scipy)

I have a large matrix, which is stored as scipy.sparse.csc_matrix and subtracts the column vector from each column of the large matrix. This is a fairly common task when you do things like normalization / standardization, but I cannot find the right way to do this effectively.

Here is an example demonstrating:

# mat is a 3x3 matrix mat = scipy.sparse.csc_matrix([[1, 2, 3], [2, 3, 4], [3, 4, 5]]) #vec is a 3x1 matrix (or a column vector) vec = scipy.sparse.csc_matrix([1,2,3]).T """ I want to subtract `vec` from each of the columns in `mat` yielding... [[0, 1, 2], [0, 1, 2], [0, 1, 2]] """ 

One way to accomplish what I want is to hstack vec for myself 3 times, which gives a 3x3 matrix, where each column is vec , and then subtracts from mat . But then again, I'm looking for a way to do this efficiently, and the hstacked matrix takes a long time to create. I'm sure there is a magical way to do this with slicing and streaming, but this eludes me.

Thanks!

EDIT: Removed the 'in-place' constraint, as the sparseness structure will constantly change in the location assignment scenario.

+6
source share
3 answers

For starters, what would we do with dense arrays?

 mat-vec.A # taking advantage of broadcasting mat-vec.A[:,[0]*3] # explicit broadcasting mat-vec[:,[0,0,0]] # that also works with csr matrix 

At https://codereview.stackexchange.com/questions/32664/numpy-scipy-optimization/33566 we found that using as_strided in the as_strided vector is the most efficient way to jump through the rows of a sparse matrix. ( x.rows , x.cols of lil_matrix almost as good. getrow slow). This function implements operations such as iteration.

 def sum(X,v): rows, cols = X.shape row_start_stop = as_strided(X.indptr, shape=(rows, 2), strides=2*X.indptr.strides) for row, (start, stop) in enumerate(row_start_stop): data = X.data[start:stop] data -= v[row] sum(mat, vec.A) print mat.A 

I use vec.A for simplicity. If we keep vec sparse, we need to add a test for a non-zero value in row . Also this type of iteration only modifies nonzero mat elements. 0's do not change.

I suspect that the benefits of time will largely depend on the sparseness of the matrix and the vector. If vec has many zeros, it makes sense to iterate, changing only those mat lines where vec is nonzero. But vec is almost dense, as in this example, it can be difficult to beat mat-vec.A .

+6
source

Summary

So, if you use CSR instead of CSC, this is single-line:

 mat.data -= numpy.repeat(vec.toarray()[0], numpy.diff(mat.indptr)) 

Explanation

If you realize this, it’s better to do it differently, since we will subtract the same number from each row. In your example, then: subtract 1 from the first row, 2 from the second row, 3 from the third row.

I really came across this in a real application where I want to classify documents, each of which is represented as a row in a matrix, and the columns represent words. Each document has a rating, which should be multiplied by the rating of each word in this document. Using the sparse matrix row representation, I did something similar to this (I changed my code to answer your question):

 mat = scipy.sparse.csc_matrix([[1, 2, 3], [2, 3, 4], [3, 4, 5]]) #vec is a 3x1 matrix (or a column vector) vec = scipy.sparse.csc_matrix([1,2,3]).T # Use the row version mat_row = mat.tocsr() vec_row = vec.T # mat_row.data contains the values in a 1d array, one-by-one from top left to bottom right in row-wise traversal. # mat_row.indptr (an n+1 element array) contains the pointer to each first row in the data, and also to the end of the mat_row.data array # By taking the difference, we basically repeat each element in the row vector to match the number of non-zero elements in each row mat_row.data -= numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr)) print mat_row.todense() 

Result:

  [[0 1 2]
  [0 1 2]
  [0 1 2]]

The visualization looks something like this:

 >>> mat_row.data [1 2 3 2 3 4 3 4 5] >>> mat_row.indptr [0 3 6 9] >>> numpy.diff(mat_row.indptr) [3 3 3] >>> numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr)) [1 1 1 2 2 2 3 3 3] >>> mat_row.data -= numpy.repeat(vec_row.toarray()[0],numpy.diff(mat_row.indptr)) [0 1 2 0 1 2 0 1 2] >>> mat_row.todense() [[0 1 2] [0 1 2] [0 1 2]] 
+2
source

You can enter fake sizes by changing the strides your vector. You can “transform” your vector into a 3 x 3 matrix without additional emphasis using np.lib.stride_tricks.as_strided . On this page there is an example and a bit of discussion of this issue, as well as a discussion of related topics (eg, views). Find the Example: Fake Measurement with Steps page.

There are also quite a few examples on SO about this ... but my search skills are not giving me now.

0
source

Source: https://habr.com/ru/post/958348/


All Articles