Question . What is the best way to convert sparse matrices obtained from sklearn CountVectorizer and TfidfTransformer into Pandas DataFrame columns with a separate row for each bigram and its corresponding frequency and tf-idf coefficient?
Pipeline: . Bring text data from the SQL database, split the text into bigrams and calculate the frequency for one document and tf-idf for each file for each document, load the results back into SQL DB.
Current state:
Two data columns ( number, text) are entered . textcleared to create the third column cleanText:
number text cleanText
0 123 The farmer plants grain farmer plants grain
1 234 The farmer and his son go fishing farmer son go fishing
2 345 The fisher catches tuna fisher catches tuna
This DataFrame is fed into the extraction of the sklearn function:
cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
dt_mat = cv.fit_transform(data.cleanText)
tfidf_transformer = TfidfTransformer()
tfidf_mat = tfidf_transformer.fit_transform(dt_mat)
DataFrame :
data['frequency'] = list(dt_mat.toarray())
data['tfidf_score']=list(tfidf_mat.toarray())
:
number text cleanText \
0 123 The farmer plants grain farmer plants grain
1 234 The farmer and his son go fishing farmer son go fishing
2 345 The fisher catches tuna fisher catches tuna
frequency tfidf_score
0 [0, 1, 0, 0, 0, 1, 0] [0.0, 0.707106781187, 0.0, 0.0, 0.0, 0.7071067...
1 [0, 0, 1, 0, 1, 0, 1] [0.0, 0.0, 0.57735026919, 0.0, 0.57735026919, ...
2 [1, 0, 0, 1, 0, 0, 0] [0.707106781187, 0.0, 0.0, 0.707106781187, 0.0...
:
- (, bigrams) DataFrame
frequency tfidf_score
:
number bigram frequency tfidf_score
0 123 farmer plants 1 0.70
0 123 plants grain 1 0.56
1 234 farmer son 1 0.72
1 234 son go 1 0.63
1 234 go fishing 1 0.34
2 345 fisher catches 1 0.43
2 345 catches tuna 1 0.43
, DataFrame :
data.reset_index(inplace=True)
rows = []
_ = data.apply(lambda row: [rows.append([row['number'], nn])
for nn in row.tfidf_score], axis=1)
df_new = pd.DataFrame(rows, columns=['number', 'tfidf_score'])
:
number tfidf_score
0 123 0.000000
1 123 0.707107
2 123 0.000000
3 123 0.000000
4 123 0.000000
5 123 0.707107
6 123 0.000000
7 234 0.000000
8 234 0.000000
9 234 0.577350
10 234 0.000000
11 234 0.577350
12 234 0.000000
13 234 0.577350
14 345 0.707107
15 345 0.000000
16 345 0.000000
17 345 0.707107
18 345 0.000000
19 345 0.000000
20 345 0.000000
, , ( ). , ( ), , , - , .
! , , - . , , - , .