Converting a dataframe column to a list of lists and converting back to a dataframe while maintaining the identity association

I have a dataframe consisting of two columns: ID and TEXT . The following are the data:

 ID TEXT 1 The farmer plants grain. The fisher catches tuna. 2 The sky is blue. 2 The sun is bright. 3 I own a phone. I own a book. 

I am doing a cleanup on a TEXT column using nltk, so I need to convert the TEXT column to a list:

 corpus = df['TEXT'].tolist() 

After cleaning (tokenization, deleting special characters and removing stop words), the output is a “list of lists” and looks like this:

 [[['farmer', 'plants', 'grain'], ['fisher', 'catches', 'tuna']], [['sky', 'blue']], [['sun', 'bright']], [['I', 'own', 'phone'], ['I', 'own', 'book']]] 

I know how to return a list to the pandas framework, but how do I get the list of lists back to the pandas framework with an identifier column that is still assigned to the text? My desired result:

 ID TEXT 1 'farmer', 'plants', 'grain' 1 'fisher', 'catches', 'tuna' 2 'sky', 'blue' 2 'sun', 'bright' 3 'I', 'own', 'phone' 3 'I', 'own', 'book' 

I assume this is a simple relation to conversion between Python data structures, but I'm not sure where to start. A specific work product is less important here than the concept of a dataframe -> native Python data structure -> something to do for the built-in Python data structure -> dataframe with intact source attributes.

Any insight you can all provide is greatly appreciated! Please let me know if I can improve my question at all!

+1
source share
1 answer

Pandas dataframes offer many quick end-to-end operations, but it’s actually much easier to get your data if it’s not stuffed in a dataframe - especially if you are just starting out. Of course, I recommend you if you work with nltk. To save text and identifiers together, convert your data file to a list of tuples. If your framework really has only two meaningful columns, you can do it as follows:

 >>> data = list(zip(df["ID"], df["TEXT"])) >>> from pprint import pprint >>> pprint(data) [(265, 'The farmer plants grain. The fisher catches tuna.'), (456, 'The sky is blue.'), (434, 'The sun is bright.'), (921, 'I own a phone. I own a book.')] 

Now, if you want to work with your offers without losing identifiers, use a two-variable loop like this. (This creates the extra lines you requested):

 sent_data = [] for id, text in data: for sent in nltk.sent_tokenize(text): sent_data.append((id, sent)) 

What you do depends on your application; you are likely to create a new list of two-element tuples. If you just apply the transform, use list comprehension. For instance:

 >>> datawords = [ (id, nltk.word_tokenize(t)) for id, t in data ] >>> print(datawords[3]) (921, ['I', 'own', 'a', 'phone', '.', 'I', 'own', 'a', 'book', '.']) 

Turning the list of tuples back into the dataframe is as simple as it gets:

  newdf = pd.DataFrame(datawords, columns=["INDEX", "WORDS"]) 
+1
source

Source: https://habr.com/ru/post/1268214/


All Articles