Remove conditional columns from sparse matrix

Question

Remove conditional columns from sparse matrix

To extract important functions from a sparse matrix, I just want to remove columns whose average score is less than a certain threshold value. Given the following example

import numpy as np
counts = [[3, 0, 1],
           [2, 0, 0],
           [3, 0, 0],
           [4, 0, 0],
           [3, 2, 0],
           [3, 0, 2]]
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer(smooth_idf=False)
tfidf = transformer.fit_transform(counts)
print (tfidf.toarray())

Now we calculate the average score of each function

summarizer_mean = lambda x: np.mean(x, axis=0)

print(summarizer_mean(tfidf))

Average results

[[ 0.81236766  0.14681658  02311266 ]]

How can I remove those columns whose average score is less than the threshold, say 0.23 in my case?

+4

python-3.x sparse-matrix tfidfvectorizer

Zeeshan nisar Mar 05 '18 at 7:23

source share

1 answer

kaidokuuppa · Accepted Answer · 2018-03-05T08:58:51+0000

You should use a numpy array for count

counts = np.array(
         [[3, 0, 1],
         [2, 0, 0],
         [3, 0, 0],
         [4, 0, 0],
         [3, 2, 0],
         [3, 0, 2]])

Then you can simply do this:

threshold = 0.23
means = np.array(summarizer_mean(tfidf)).squeeze()
counts[:, means > threshold]

Remove conditional columns from sparse matrix

More articles: