Run nltk sent_tokenize via Pandas dataframe

Question

Run nltk sent_tokenize via Pandas dataframe

I have a data framework that consists of two columns: ID and TEXT. The following are the data:

ID TEXT 265 The farmer plants grain. The fisher catches tuna. 456 The sky is blue. 434 The sun is bright. 921 I own a phone. I own a book.

I know that all nltk functions do not work on dataframes. How can send_tokenize be applied to the above piece of data?

When I try:

 df.TEXT.apply(nltk.sent_tokenize)

The output does not change from the original frame. My desired result:

 TEXT The farmer plants grain. The fisher catches tuna. The sky is blue. The sun is bright. I own a phone. I own a book.

Also, I would like to bind this new (desired) DataFrame to the original ID numbers, like this (after further clearing the text):

 ID TEXT 265 'farmer', 'plants', 'grain' 265 'fisher', 'catches', 'tuna' 456 'sky', 'blue' 434 'sun', 'bright' 921 'I', 'own', 'phone' 921 'I', 'own', 'book'

This question is related to another of my questions here . Please let me know if I can provide anything to help clarify my question!

0

python pandas dataframe nltk

Cameron taylor May 11 '17 at 17:32

source share

2 answers

This is a bit complicated. First I apply the sentence tokenization, then I look through each sentence and remove the words from the remove_words list and remove the punctuation marks for each word inside.

 import pandas as pd from nltk import sent_tokenize from string import punctuation remove_words = ['the', 'an', 'a'] def remove_punctuation(chars): return ''.join([c for c in chars if c not in punctuation]) # example dataframe df = pd.DataFrame([[265, "The farmer plants grain. The fisher catches tuna."], [456, "The sky is blue."], [434, "The sun is bright."], [921, "I own a phone. I own a book."]], columns=['sent_id', 'text']) df.loc[:, 'text_split'] = df.text.map(sent_tokenize) sentences = [] for _, r in df.iterrows(): for s in r.text_split: filtered_words = [remove_punctuation(w) for w in s.split() if w.lower() not in remove_words] # or using nltk.word_tokenize # filtered_words = [w for w in word_tokenize(s) if w.lower() not in remove_words and w not in punctuation] sentences.append({'sent_id': r.sent_id, 'text': s.strip('.'), 'words': filtered_words}) df_words = pd.DataFrame(sentences)

Output

 +-------+--------------------+--------------------+ |sent_id| text| words| +-------+--------------------+--------------------+ | 265|The farmer plants...|[farmer, plants, ...| | 265|The fisher catche...|[fisher, catches,...| | 456| The sky is blue| [sky, is, blue]| | 434| The sun is bright| [sun, is, bright]| | 921| I own a phone| [I, own, phone]| | 921| I own a book| [I, own, book]| +-------+--------------------+--------------------+

0

titipata May 11, '17 at 20:25

source share

abe · Accepted Answer · 2017-05-11T17:50:28+0000

edit : as a result of guaranteed pushing @alexis here is the best answer

Marking sentences

This should provide you with a DataFrame with one row for each id and sentence:

 sentences = [] for row in df.itertuples(): for sentence in row[2].split('.'): if sentence != '': sentences.append((row[1], sentence)) new_df = pandas.DataFrame(sentences, columns=['ID', 'SENTENCE'])

The output is as follows:

split ('.') quickly breaks the lines into sentences if offers are actually separated by periods, and periods are not used for other things (for example, indicate abbreviations) and will delete periods in the process. This will not succeed if for periods there are several use cases and / or not all endings of sentences are indicated by periods. A slower, but much more reliable approach would be to use, as you requested, send_tokenize to break the lines into sentences:

 sentences = [] for row in df.itertuples(): for sentence in sent_tokenize(row[2]): sentences.append((row[1], sentence)) new_df = pandas.DataFrame(sentences, columns=['ID', 'SENTENCE'])

This leads to the following conclusion:

If you want to quickly remove periods from these lines, you can do something like:

 new_df['SENTENCE_noperiods'] = new_df.SENTENCE.apply(lambda x: x.strip('.'))

What will give:

You can also apply apply → map approach (df is your source table):

 df = df.join(df.TEXT.apply(sent_tokenize).rename('SENTENCES'))

Yielding:

Continuation:

 sentences = df.SENTENCES.apply(pandas.Series) sentences.columns = ['sentence {}'.format(n + 1) for n in sentences.columns]

This gives:

Since our indexes have not changed, we can join this in our original table:

 df = df.join(sentences)

Toxicification of words

Continuing with df from above, we can extract the tokens in this sentence as follows:

 df['sent_1_words'] = df['sentence 1'].apply(word_tokenize)

Run nltk sent_tokenize via Pandas dataframe

More articles: