Convert CountVectorizer and TfidfTransformer sparse matrices into separate Pandas Dataframe rows

Question . What is the best way to convert sparse matrices obtained from sklearn CountVectorizer and TfidfTransformer into Pandas DataFrame columns with a separate row for each bigram and its corresponding frequency and tf-idf coefficient?

Pipeline: . Bring text data from the SQL database, split the text into bigrams and calculate the frequency for one document and tf-idf for each file for each document, load the results back into SQL DB.

Current state:

Two data columns ( number, text) are entered . textcleared to create the third column cleanText:

   number                               text              cleanText
0     123            The farmer plants grain    farmer plants grain
1     234  The farmer and his son go fishing  farmer son go fishing
2     345            The fisher catches tuna    fisher catches tuna

This DataFrame is fed into the extraction of the sklearn function:

cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
dt_mat = cv.fit_transform(data.cleanText)

tfidf_transformer = TfidfTransformer()
tfidf_mat = tfidf_transformer.fit_transform(dt_mat)

DataFrame :

data['frequency'] = list(dt_mat.toarray())
data['tfidf_score']=list(tfidf_mat.toarray())

:

   number                               text              cleanText  \
0     123            The farmer plants grain    farmer plants grain   
1     234  The farmer and his son go fishing  farmer son go fishing   
2     345            The fisher catches tuna    fisher catches tuna   

               frequency                                        tfidf_score  

0  [0, 1, 0, 0, 0, 1, 0]  [0.0, 0.707106781187, 0.0, 0.0, 0.0, 0.7071067...  
1  [0, 0, 1, 0, 1, 0, 1]  [0.0, 0.0, 0.57735026919, 0.0, 0.57735026919, ...  
2  [1, 0, 0, 1, 0, 0, 0]  [0.707106781187, 0.0, 0.0, 0.707106781187, 0.0... 

:

  • (, bigrams) DataFrame
  • frequency tfidf_score

:

       number                    bigram         frequency      tfidf_score
0     123            farmer plants                 1              0.70  
0     123            plants grain                  1              0.56
1     234            farmer son                    1              0.72
1     234            son go                        1              0.63
1     234            go fishing                    1              0.34
2     345            fisher catches                1              0.43
2     345            catches tuna                  1              0.43

, DataFrame :

data.reset_index(inplace=True)
rows = []
_ = data.apply(lambda row: [rows.append([row['number'], nn]) 
                         for nn in row.tfidf_score], axis=1)
df_new = pd.DataFrame(rows, columns=['number', 'tfidf_score'])

:

    number  tfidf_score
0      123     0.000000
1      123     0.707107
2      123     0.000000
3      123     0.000000
4      123     0.000000
5      123     0.707107
6      123     0.000000
7      234     0.000000
8      234     0.000000
9      234     0.577350
10     234     0.000000
11     234     0.577350
12     234     0.000000
13     234     0.577350
14     345     0.707107
15     345     0.000000
16     345     0.000000
17     345     0.707107
18     345     0.000000
19     345     0.000000
20     345     0.000000

, , ( ). , ( ), , , - , .

! , , - . , , - , .

+4
1

bigram CountVectorizer get_feature_names(). melt merge:

print(data)

   number                               text              cleanText
0     123            The farmer plants grain    farmer plants grain
1     234  The farmer and his son go fishing  farmer son go fishing
2     345            The fisher catches tuna    fisher catches tuna

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
dt_mat = cv.fit_transform(data.cleanText)

tfidf_transformer = TfidfTransformer()
tfidf_mat = tfidf_transformer.fit_transform(dt_mat)

CountVectorizer - , , bigrams:

print(cv.get_feature_names())

[u'catches tuna',
 u'farmer plants',
 u'farmer son',
 u'fisher catches',
 u'go fishing',
 u'plants grain',
 u'son go']

CountVectorizer.fit_transform() . , DataFrame, :

bigrams = pd.DataFrame(dt_mat.todense(), index=data.index, columns=cv.get_feature_names())
bigrams['number'] = data.number
print(bigrams)

   catches tuna  farmer plants  farmer son  fisher catches  go fishing  \
0             0              1           0               0           0   
1             0              0           1               0           1   
2             1              0           0               1           0   

   plants grain  son go  number  
0             1       0     123  
1             0       1     234  
2             0       0     345  

, melt().
bigram (query() ):

bigrams_long = (pd.melt(bigrams.reset_index(), 
                       id_vars=['index','number'],
                       value_name='bigram_ct')
                 .query('bigram_ct > 0')
                 .sort_values(['index','number']))

    index  number        variable  bigram_ct
3       0     123   farmer plants          1
15      0     123    plants grain          1
7       1     234      farmer son          1
13      1     234      go fishing          1
19      1     234          son go          1
2       2     345    catches tuna          1
11      2     345  fisher catches          1

tfidf:

tfidf = pd.DataFrame(tfidf_mat.todense(), index=data.index, columns=cv.get_feature_names())
tfidf['number'] = data.number

tfidf_long = pd.melt(tfidf.reset_index(), 
                     id_vars=['index','number'], 
                     value_name='tfidf').query('tfidf > 0')

, bigrams tfidf:

fulldf = (bigrams_long.merge(tfidf_long, 
                             on=['index','number','variable'])
                      .set_index('index'))

       number        variable  bigram_ct     tfidf
index                                             
0         123   farmer plants          1  0.707107
0         123    plants grain          1  0.707107
1         234      farmer son          1  0.577350
1         234      go fishing          1  0.577350
1         234          son go          1  0.577350
2         345    catches tuna          1  0.707107
2         345  fisher catches          1  0.707107
+2

Source: https://habr.com/ru/post/1677083/


All Articles