I have a long list of (200,000+) phrases:
phrase_list = ['some word', 'another example', ...]
And the pandas two-frame framework with a description in the first column and some score in the second
Description Score
this sentence contains some word in it 6
some word is on my mind 3
repeat another example of me 2
this sentence has no matches 100
another example with some word 10
There are 300,000 lines. For each phrase in the phrase list, I want to get the cumulative score if this phrase is found on each line. So, for “some word” the score will be 6 + 3 + 10 = 19. For “another example,” the score will be 2 + 10 = 12.
The code that I still work, but very slow:
phrase_score = []
for phrase in phrase_list:
phrase_score.append([phrase, df['score'][df['description'].str.contains(phrase)].sum()])
I would like to return the pandas framework with the phrase in one column, and the score in the second (this part is trivial if I have a list of lists). However, I need a faster way to get a list of lists.