Removing stop words from a file

Question

Removing stop words from a file

I want to remove stop words from the Data column in my file. I filtered the line when the end user says. But it does not filter out stop words with usertext.apply(lambda x: [word for word in x if word not in stop_words]) What am I doing wrong?

 import pandas as pd from stop_words import get_stop_words df = pd.read_csv("F:/textclustering/data/cleandata.csv", encoding="iso-8859-1") usertext = df[df.Role.str.contains("End-user",na=False)][['Data','chatid']] stop_words = get_stop_words('dutch') clean = usertext.apply(lambda x: [word for word in x if word not in stop_words]) print(clean)

+5

python pandas csv

DataNewB Mar 08 '17 at 14:33

source share

3 answers

Edchum · Answer 1 · 2017-03-08T14:55:42+0000

You can create a regex pattern for your stop words and call the vectorized str.replace to remove them:

 In [124]: stop_words = ['a','not','the'] stop_words_pat = '|'.join(['\\b' + stop + '\\b' for stop in stop_words]) stop_words_pat Out[124]: '\\ba\\b|\\bnot\\b|\\bthe\\b' In [125]: df = pd.DataFrame({'text':['a to the b', 'the knot ace a']}) df['text'].str.replace(stop_words_pat, '') Out[125]: 0 to b 1 knot ace Name: text, dtype: object

here we do a list comprehension to build a pattern surrounding each stop word, with '\b' , which is a gap, and then we or all words using '|'

Austin hastings · Answer 2 · 2017-03-08T15:10:39+0000

Two questions:

First you have a module called stop_words , and you will later create a variable called stop_words . This is a bad form.

Secondly, you pass the lambda function to .apply , which wants its parameter x be a list, not a value inside a list.

That is, instead of df.apply(sqrt) you do df.apply(lambda x: [sqrt(val) for val in x]) .

You must either do list processing yourself:

 clean = [x for x in usertext if x not in stop_words]

Or you should apply, with a function that takes one word at a time:

 clean = usertext.apply(lambda x: x if x not in stop_words else '')

As @ Jean-François Fabre explained in a comment, you can speed things up if your stopwords are a set, not a list:

 from stop_words import get_stop_words nl_stop_words = set(get_stop_words('dutch')) # NOTE: set usertext = ... clean = usertext.apply(lambda word: word if word not in nl_stop_words else '')

galaxyan · Answer 3 · 2017-03-08T14:40:22+0000

 clean = usertext.apply(lambda x: x if x not in stop_words else '')

Removing stop words from a file

More articles: