I have a summary representation of a body stored in D with a sparse matrix word_freqs. Each row is a document, and each column is a word. This element word_freqs[d,w]represents the number of occurrences of the word w in document d.
I am trying to get another matrix D by W not_word_occs, where for each element word_freqs:
- If
word_freqs[d,w]equal to zero, not_word_occs[d,w]must be one. - Otherwise, it
not_word_occs[d,w]must be zero.
In the end, this matrix will need to be multiplied by other matrices, which can be dense or sparse.
I tried several methods, including:
not_word_occs = (word_freqs == 0).astype(int)
These are words for toy examples, but leads to a MemoryErrorfor my actual data (about 18,000 x 16,000).
I also tried np.logical_not():
word_occs = sklearn.preprocessing.binarize(word_freqs)
not_word_occs = np.logical_not(word_freqs).astype(int)
This seemed promising, but np.logical_not()didn't work on sparse matrices, which gives the following error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
Any ideas or recommendations would be appreciated.
(By the way, it word_freqsis created using sklearn preprocessing.CountVectorizer(). If there is a solution that includes converting this to a different kind of matrix, I, of course, am open to this.)