Remove conditional columns from sparse matrix

To extract important functions from a sparse matrix, I just want to remove columns whose average score is less than a certain threshold value. Given the following example

import numpy as np
counts = [[3, 0, 1],
           [2, 0, 0],
           [3, 0, 0],
           [4, 0, 0],
           [3, 2, 0],
           [3, 0, 2]]
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer(smooth_idf=False)
tfidf = transformer.fit_transform(counts)
print (tfidf.toarray())

Now we calculate the average score of each function

summarizer_mean = lambda x: np.mean(x, axis=0)

print(summarizer_mean(tfidf))

Average results

[[ 0.81236766  0.14681658  02311266 ]]

How can I remove those columns whose average score is less than the threshold, say 0.23 in my case?

+4
source share
1 answer

You should use a numpy array for count

counts = np.array(
         [[3, 0, 1],
         [2, 0, 0],
         [3, 0, 0],
         [4, 0, 0],
         [3, 2, 0],
         [3, 0, 2]])

Then you can simply do this:

threshold = 0.23
means = np.array(summarizer_mean(tfidf)).squeeze()
counts[:, means > threshold]
0
source

Source: https://habr.com/ru/post/1694454/


All Articles