Run nltk sent_tokenize via Pandas dataframe

I have a data framework that consists of two columns: ID and TEXT. The following are the data:

ID TEXT 265 The farmer plants grain. The fisher catches tuna. 456 The sky is blue. 434 The sun is bright. 921 I own a phone. I own a book. 

I know that all nltk functions do not work on dataframes. How can send_tokenize be applied to the above piece of data?

When I try:

 df.TEXT.apply(nltk.sent_tokenize) 

The output does not change from the original frame. My desired result:

 TEXT The farmer plants grain. The fisher catches tuna. The sky is blue. The sun is bright. I own a phone. I own a book. 

Also, I would like to bind this new (desired) DataFrame to the original ID numbers, like this (after further clearing the text):

 ID TEXT 265 'farmer', 'plants', 'grain' 265 'fisher', 'catches', 'tuna' 456 'sky', 'blue' 434 'sun', 'bright' 921 'I', 'own', 'phone' 921 'I', 'own', 'book' 

This question is related to another of my questions here . Please let me know if I can provide anything to help clarify my question!

0
source share
2 answers

edit : as a result of guaranteed pushing @alexis here is the best answer

Marking sentences

This should provide you with a DataFrame with one row for each id and sentence:

 sentences = [] for row in df.itertuples(): for sentence in row[2].split('.'): if sentence != '': sentences.append((row[1], sentence)) new_df = pandas.DataFrame(sentences, columns=['ID', 'SENTENCE']) 

The output is as follows:

enter image description here

split ('.') quickly breaks the lines into sentences if offers are actually separated by periods, and periods are not used for other things (for example, indicate abbreviations) and will delete periods in the process. This will not succeed if for periods there are several use cases and / or not all endings of sentences are indicated by periods. A slower, but much more reliable approach would be to use, as you requested, send_tokenize to break the lines into sentences:

 sentences = [] for row in df.itertuples(): for sentence in sent_tokenize(row[2]): sentences.append((row[1], sentence)) new_df = pandas.DataFrame(sentences, columns=['ID', 'SENTENCE']) 

This leads to the following conclusion:

enter image description here

If you want to quickly remove periods from these lines, you can do something like:

 new_df['SENTENCE_noperiods'] = new_df.SENTENCE.apply(lambda x: x.strip('.')) 

What will give:

enter image description here

You can also apply apply → map approach (df is your source table):

 df = df.join(df.TEXT.apply(sent_tokenize).rename('SENTENCES')) 

Yielding:

enter image description here

Continuation:

 sentences = df.SENTENCES.apply(pandas.Series) sentences.columns = ['sentence {}'.format(n + 1) for n in sentences.columns] 

This gives:

enter image description here

Since our indexes have not changed, we can join this in our original table:

 df = df.join(sentences) 

enter image description here

Toxicification of words

Continuing with df from above, we can extract the tokens in this sentence as follows:

 df['sent_1_words'] = df['sentence 1'].apply(word_tokenize) 

enter image description here

+1
source

This is a bit complicated. First I apply the sentence tokenization, then I look through each sentence and remove the words from the remove_words list and remove the punctuation marks for each word inside.

 import pandas as pd from nltk import sent_tokenize from string import punctuation remove_words = ['the', 'an', 'a'] def remove_punctuation(chars): return ''.join([c for c in chars if c not in punctuation]) # example dataframe df = pd.DataFrame([[265, "The farmer plants grain. The fisher catches tuna."], [456, "The sky is blue."], [434, "The sun is bright."], [921, "I own a phone. I own a book."]], columns=['sent_id', 'text']) df.loc[:, 'text_split'] = df.text.map(sent_tokenize) sentences = [] for _, r in df.iterrows(): for s in r.text_split: filtered_words = [remove_punctuation(w) for w in s.split() if w.lower() not in remove_words] # or using nltk.word_tokenize # filtered_words = [w for w in word_tokenize(s) if w.lower() not in remove_words and w not in punctuation] sentences.append({'sent_id': r.sent_id, 'text': s.strip('.'), 'words': filtered_words}) df_words = pd.DataFrame(sentences) 

Output

 +-------+--------------------+--------------------+ |sent_id| text| words| +-------+--------------------+--------------------+ | 265|The farmer plants...|[farmer, plants, ...| | 265|The fisher catche...|[fisher, catches,...| | 456| The sky is blue| [sky, is, blue]| | 434| The sun is bright| [sun, is, bright]| | 921| I own a phone| [I, own, phone]| | 921| I own a book| [I, own, book]| +-------+--------------------+--------------------+ 
0
source

Source: https://habr.com/ru/post/1268212/


All Articles