Lemmatization of all panda cells

I have a panda framework. There is one column, name it: 'col' Each entry in this column is a list of words. ['word1', 'word2', etc.]

How can I efficiently calculate the lemma of all these words using the nklt library?

import nklt
nltk.stem.WordNetLemmatizer().lemmatize('word')

I want to find a lemma for all words of all cells in one column of the pandas dataset.

My data looks something like this:

import pandas as pd
data = [[['walked','am','stressed','Fruit']],[['going','gone','walking','riding','running']]]
df = pd.DataFrame(data,columns=['col'])
+4
source share
2 answers

You can use applyfrom pandas with a function to lemmatize every word in a given line. Please note that there are many ways to tokenize text. You may need to remove characters, such as .if you are using a space tokenizer.

, .

import nltk

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

df = pd.DataFrame(['this was cheesy', 'she likes these books', 'wow this is great'], columns=['text'])
df['text_lemmatized'] = df.text.apply(lemmatize_text)
+1
|col| 
['Sushi Bars', 'Restaurants']
['Burgers', 'Fast Food', 'Restaurants']

wnl = WordNetLemmatizer()

, . .

def lemmatize(s):
'''For lemmatizing the word
'''
     s = [wnl.lemmatize(word) for word in s]
     return s

dataset.loc[:,"col_lemma"] = dataset.col.apply(lambda x: lemmatize(x))
+1

Source: https://habr.com/ru/post/1690027/


All Articles