Logically not on a lean sparse matrix

I have a summary representation of a body stored in D with a sparse matrix word_freqs. Each row is a document, and each column is a word. This element word_freqs[d,w]represents the number of occurrences of the word w in document d.

I am trying to get another matrix D by W not_word_occs, where for each element word_freqs:

  • If word_freqs[d,w]equal to zero, not_word_occs[d,w]must be one.
  • Otherwise, it not_word_occs[d,w]must be zero.

In the end, this matrix will need to be multiplied by other matrices, which can be dense or sparse.


I tried several methods, including:

not_word_occs = (word_freqs == 0).astype(int)

These are words for toy examples, but leads to a MemoryErrorfor my actual data (about 18,000 x 16,000).

I also tried np.logical_not():

word_occs = sklearn.preprocessing.binarize(word_freqs)
not_word_occs = np.logical_not(word_freqs).astype(int)

This seemed promising, but np.logical_not()didn't work on sparse matrices, which gives the following error:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

Any ideas or recommendations would be appreciated.

(By the way, it word_freqsis created using sklearn preprocessing.CountVectorizer(). If there is a solution that includes converting this to a different kind of matrix, I, of course, am open to this.)

+4
source share
3 answers

. , numpy, . , , :

>>> import numpy as np
>>> a = []
>>> for j in range(100):
...     print(j)
...     a.append(np.ones((16000, 18000), dtype=int))

j = 1. , , (

>>> compl = np.ones(S.shape,int)
>>> compl[S.nonzero()] = 0

) .

, C = B1 - A, B1 , , A - . , XC XB1-XA, A B1, , . , , C.

. ( ) ( ) . , ( ), , , .

+1

:

In [743]: freq = sparse.random(10,10,.1)
In [744]: freq
Out[744]: 
<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 10 stored elements in COOrdinate format>

repr(freq) , .

In [745]: freq==0
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:213: SparseEfficiencyWarning: Comparing a sparse matrix with 0 using == is inefficient, try using != instead.
  ", try using != instead.", SparseEfficiencyWarning)
Out[745]: 
<10x10 sparse matrix of type '<class 'numpy.bool_'>'
    with 90 stored elements in Compressed Sparse Row format>

, 90 ( 100) . not .

numpy . , . logical_not , .

0

Pandas.SparseDataFrame:

In [42]: X = (sparse.rand(10, 10, .1) != 0).astype(np.int64)

In [43]: X = (sparse.rand(10, 10, .1) != 0).astype(np.int64)

In [44]: d1 = pd.SparseDataFrame(X.toarray(), default_fill_value=0, dtype=np.int64)

In [45]: d2 = pd.SparseDataFrame(np.ones((10,10)), default_fill_value=1, dtype=np.int64)

In [46]: d1.memory_usage()
Out[46]:
Index    80
0        16
1         0
2         8
3        16
4         0
5         0
6        16
7        16
8         8
9         0
dtype: int64

In [47]: d2.memory_usage()
Out[47]:
Index    80
0         0
1         0
2         0
3         0
4         0
5         0
6         0
7         0
8         0
9         0
dtype: int64

:

In [48]: d2 - d1
Out[48]:
   0  1  2  3  4  5  6  7  8  9
0  1  1  0  0  1  1  0  1  1  1
1  1  1  1  1  1  1  1  1  0  1
2  1  1  1  1  1  1  1  1  1  1
3  1  1  1  1  1  1  1  0  1  1
4  1  1  1  1  1  1  1  1  1  1
5  0  1  1  1  1  1  1  1  1  1
6  1  1  1  1  1  1  1  1  1  1
7  0  1  1  0  1  1  1  0  1  1
8  1  1  1  1  1  1  0  1  1  1
9  1  1  1  1  1  1  1  1  1  1

:

In [49]: d1
Out[49]:
   0  1  2  3  4  5  6  7  8  9
0  0  0  1  1  0  0  1  0  0  0
1  0  0  0  0  0  0  0  0  1  0
2  0  0  0  0  0  0  0  0  0  0
3  0  0  0  0  0  0  0  1  0  0
4  0  0  0  0  0  0  0  0  0  0
5  1  0  0  0  0  0  0  0  0  0
6  0  0  0  0  0  0  0  0  0  0
7  1  0  0  1  0  0  0  1  0  0
8  0  0  0  0  0  0  1  0  0  0
9  0  0  0  0  0  0  0  0  0  0

:

In [50]: (d2 - d1).memory_usage()
Out[50]:
Index    80
0        16
1         0
2         8
3        16
4         0
5         0
6        16
7        16
8         8
9         0
dtype: int64

PS, SparseDataFrame (- ),

0

Source: https://habr.com/ru/post/1668885/


All Articles