I have a dataframe consisting of two columns: ID and TEXT . The following are the data:
ID TEXT 1 The farmer plants grain. The fisher catches tuna. 2 The sky is blue. 2 The sun is bright. 3 I own a phone. I own a book.
I am doing a cleanup on a TEXT column using nltk, so I need to convert the TEXT column to a list:
corpus = df['TEXT'].tolist()
After cleaning (tokenization, deleting special characters and removing stop words), the output is a “list of lists” and looks like this:
[[['farmer', 'plants', 'grain'], ['fisher', 'catches', 'tuna']], [['sky', 'blue']], [['sun', 'bright']], [['I', 'own', 'phone'], ['I', 'own', 'book']]]
I know how to return a list to the pandas framework, but how do I get the list of lists back to the pandas framework with an identifier column that is still assigned to the text? My desired result:
ID TEXT 1 'farmer', 'plants', 'grain' 1 'fisher', 'catches', 'tuna' 2 'sky', 'blue' 2 'sun', 'bright' 3 'I', 'own', 'phone' 3 'I', 'own', 'book'
I assume this is a simple relation to conversion between Python data structures, but I'm not sure where to start. A specific work product is less important here than the concept of a dataframe -> native Python data structure -> something to do for the built-in Python data structure -> dataframe with intact source attributes.
Any insight you can all provide is greatly appreciated! Please let me know if I can improve my question at all!