Effectively defined series in SciPy sparse.lil_matrix?

I have sparse vectors with a dimension of about 200,000. I also have a matrix with the same number of columns and as many rows as the number of vectors. I want to install all this in stages with respect to the matrix, that is, the first vector must be installed in the first row and so on.

The matrix and vectors are currently of type scipy.sparse.lil_matrix. Vectors are set to a specific row of the matrix using the following function:

In [7]: us.get_utterance_representation('here is a sentence')
Out[7]:
<1x188796 sparse matrix of type '<type 'numpy.float64'>'
    with 22489 stored elements in Compressed Sparse Row format>

def set_row_vector(self, row, rowvector):
    self.matrix[row] = rowvector[0]

for row, utterance in enumerate(utterances):
    uvector = self.get_utterance_representation(utterance)
    self.utterancematrix.add_row_vector(row, uvector)

Where uvector is a lil_matrix of dimension 1x ~ 200.000.

Creating a matrix in this way is extremely inefficient when one text line (statement) takes up to 5 seconds. Looking at the profiling, I came to the conclusion that the main problem is setting the vector as a row in the matrix.

     55     def set_row_vector(self, row, rowvector):
         2564609 function calls (2564606 primitive calls) in 5.046 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    22489    1.397    0.000    1.397    0.000 {numpy.core.multiarray.where}
    22489    0.783    0.000    2.188    0.000 csr.py:281(_get_single_element)
    44978    0.365    0.000    0.916    0.000 stride_tricks.py:35(broadcast_arrays)
    44978    0.258    0.000    0.413    0.000 stride_tricks.py:22(as_strided)
   202490    0.244    0.000    0.244    0.000 {numpy.core.multiarray.array}
    22489    0.199    0.000    2.221    0.000 lil.py:280(__setitem__)
    44978    0.174    0.000    0.399    0.000 sputils.py:171(_unpack_index)
   584777    0.171    0.000    0.171    0.000 {isinstance}
    44988    0.170    0.000    0.230    0.000 sputils.py:115(isintlike)
    67467    0.166    0.000    0.278    0.000 sputils.py:196(_check_boolean)
    22489    0.154    0.000    0.647    0.000 sputils.py:215(_index_to_arrays)
        1    0.129    0.129    5.035    5.035 dsm_classes.py:55(set_row_vector)
    22489    0.120    0.000    0.171    0.000 lil.py:247(_insertat2)
    67467    0.102    0.000    0.102    0.000 {method 'ravel' of 'numpy.ndarray' objects}

: ?

()

+4
1

-, , uvector CSR, LIL. , , :

In [30]: import scipy.sparse as ss

In [31]: row = ss.rand(1,5000,0.1,'csr')

In [32]: matrix = ss.lil_matrix((30,5000))

In [33]: %timeit matrix[0] = row
10 loops, best of 3: 65.6 ms per loop

In [34]: row_lil = row.tolil()

In [35]: %timeit matrix[0] = row_lil
10 loops, best of 3: 93.4 ms per loop

, [0] rowvector:

In [38]: %timeit matrix[0] = row[0]
10 loops, best of 3: 104 ms per loop

In [39]: %timeit matrix[0] = row
10 loops, best of 3: 68.7 ms per loop

, , LIL. , (). , , scipy.sparse.vstack:

In [40]: %%timeit
   ....: for i in xrange(matrix.shape[0]):
   ....:   matrix[i] = row
   ....:
1 loops, best of 3: 3.14 s per loop

In [41]: %timeit ss.vstack([row for i in xrange(matrix.shape[0])])
1000 loops, best of 3: 1.46 ms per loop

In [44]: m2 = ss.vstack([row for i in xrange(matrix.shape[0])])

In [45]: numpy.allclose(matrix.todense(), m2.todense())
Out[45]: True

. , , vstack vstack CSR-. _compressed_sparse_stack CSR axis = 0. , , . , for. , , , , .

+6

Source: https://habr.com/ru/post/1524545/


All Articles