How to use word_tokenize in a data frame

Question

How to use word_tokenize in a data frame

I recently started using the nltk module for text analysis. I'm stuck in one place. I want to use word_tokenize on a data frame to get all the words used in a particular row of a data block.

data example: text 1. This is a very good site. I will recommend it to others. 2. Can you please give me a call at 9983938428. have issues with the listings. 3. good work! keep it up 4. not a very helpful site in finding home decor. expected output: 1. 'This','is','a','very','good','site','.','I','will','recommend','it','to','others','.' 2. 'Can','you','please','give','me','a','call','at','9983938428','.','have','issues','with','the','listings' 3. 'good','work','!','keep','it','up' 4. 'not','a','very','helpful','site','in','finding','home','decor'

Basically, I want to separate all the words and find the length of each text in the data frame.

I know that word_tokenize can be used for a string for it, but how to apply it to the entire data frame?

Please, help!

Thanks in advance...

+5

python pandas nltk

eclairs Oct 13 '15 at 8:44

source share

2 answers

pandas.Series.apply is faster than pandas.DataFrame.apply

 import pandas as pd import nltk df = pd.read_csv("/path/to/file.csv") start = time.time() df["unigrams"] = df["verbatim"].apply(nltk.word_tokenize) print "series.apply", (time.time() - start) start = time.time() df["unigrams2"] = df.apply(lambda row: nltk.word_tokenize(row["verbatim"]), axis=1) print "dataframe.apply", (time.time() - start)

On the sample csv file 125 MB,

series.apply 144.428858995

dataframe.apply 201.884778976

Change You might think about Dataframe df after series.apply (nltk.word_tokenize) is larger, which may affect the execution time for the next dataframe.apply (nltk.word_tokenize) operation.

Pandas is optimized under the hood for this scenario. I got a similar 200s runtime by only executing dataframe.apply (nltk.word_tokenize) separately.

+10

Harsha manjunath Jul 07 '16 at 20:38

source share

Gregg · Accepted Answer · 2015-10-13T09:00:00+0000

You can use the apply API DataFrame method:

 import pandas as pd import nltk df = pd.DataFrame({'sentences': ['This is a very good site. I will recommend it to others.', 'Can you please give me a call at 9983938428. have issues with the listings.', 'good work! keep it up']}) df['tokenized_sents'] = df.apply(lambda row: nltk.word_tokenize(row['sentences']), axis=1)

Output:

 >>> df sentences \ 0 This is a very good site. I will recommend it ... 1 Can you please give me a call at 9983938428. h... 2 good work! keep it up tokenized_sents 0 [This, is, a, very, good, site, ., I, will, re... 1 [Can, you, please, give, me, a, call, at, 9983... 2 [good, work, !, keep, it, up]

To find the length of each text, try using the apply function and lambda again:

 df['sents_length'] = df.apply(lambda row: len(row['tokenized_sents']), axis=1) >>> df sentences \ 0 This is a very good site. I will recommend it ... 1 Can you please give me a call at 9983938428. h... 2 good work! keep it up tokenized_sents sents_length 0 [This, is, a, very, good, site, ., I, will, re... 14 1 [Can, you, please, give, me, a, call, at, 9983... 15 2 [good, work, !, keep, it, up] 6

How to use word_tokenize in a data frame

More articles: