Converting a co-filtering code to use sparse matrices I am puzzled by the following problem: given the two full matrices X (m by l) and Theta (n by l) and the sparse matrix R (m by n), there is a quick way to calculate the sparse internal product. Large sizes are m and n (order 100000), and l is small (order 10). This is probably a fairly common operation for big data, as it appears in the cost function of most problems with linear regression, so I would expect a solution built into scipy.sparse, but I have not found anything obvious so far.
The naive way to do this in python is R.multiply (XTheta.T), but this will result in evaluating the full XTheta.T matrix (m by n, order 100000 ** 2), which takes up too much memory, then flushing most of the entries since R is sparse.
There is a pseudo-solution of https://stackoverflow.com/a/146505/ ... but it is not resolved in one step:
def sparse_mult_notreally(a, b, coords): rows, cols = coords rows, r_idx = np.unique(rows, return_inverse=True) cols, c_idx = np.unique(cols, return_inverse=True) C = np.array(np.dot(a[rows, :], b[:, cols]))
This works fine, and fast, for me on fairly small arrays, but it overlays my large datasets with the following error:
... in sparse_mult(a, b, coords) 132 rows, r_idx = np.unique(rows, return_inverse=True) 133 cols, c_idx = np.unique(cols, return_inverse=True) --> 134 C = np.array(np.dot(a[rows, :], b[:, cols]))
A solution that is actually sparse but very slow:
def sparse_mult(a, b, coords): rows, cols = coords n = len(rows) C = np.array([ float(a[rows[i],:]*b[:,cols[i]]) for i in range(n) ])
Does anyone know a quick, completely sparse way to do this?