Symmetric python word matrix using nltk

Question

Symmetric python word matrix using nltk

I am trying to create a symmetric matrix of words from a text document.

For example: text = "Barbara is good. Barbara is friends with Benny, Benny is bad."

I signed a text document using nltk. Now I want to calculate how many times other words appear in one sentence. From the above text, I want to create a matrix below:

Barbara good friends Benny bad Barbara 2 1 1 1 0 good 1 1 0 0 0 friends 1 0 1 1 0 Benny 1 0 1 2 1 bad 0 0 1 1 1

Note that the diagonal is the word frequency. As Barbara appears with Barbara in the sentence as often as there are Barbara. I hope not to retrain, but this is not a big problem if the code gets too complicated.

+4

python nltk text-mining

mumpy Jul 03 '13 at 21:53

source share

2 answers

I would first set up something like the following. Perhaps something to add to the tokenization; although there was no need for your example.

 text = """Barbara is good. Barbara is friends with Benny. Benny is bad.""" allwords = text.replace('.','').split(' ') word_to_index = {} index_to_word = {} index = 0 for word in allwords: if word not in word_to_index: word_to_index[word] = index index_to_word[index] = word index += 1 word_count = index >>> index_to_word {0: 'Barbara', 1: 'is', 2: 'good', 3: 'friends', 4: 'with', 5: 'Benny', 6: 'bad'} >>> word_to_index {'Barbara': 0, 'Benny': 5, 'bad': 6, 'friends': 3, 'good': 2, 'is': 1, 'with': 4}

Then declare the matrix of the required size (word_count x word_count); possibly using numpy like

 import numpy matrix = numpy.zeros((word_count, word_count))

or just a nested list:

 matrix = [None,]*word_count for i in range(word_count): matrix[i] = [0,]*word_count

Note that this is complicated, and something like matrix = [[0]*word_count]*word_count will not work, as it will make a list with 7 references to the same internal array (for example, if you try this code and then do matrix[0][1] = 1 , you will find matrix[1][1] , matrix[2][1] , etc. will also be changed to 1).

Then you just need to sort through your sentences.

 sentences = text.split('.') for sent in sentences: for word1 in sent.split(' '): if word1 not in word_to_index: continue for word2 in sent.split(' '): if word2 not in word_to_index: continue matrix[word_to_index[word1]][word_to_index[word2]] += 1

Then you will get:

 >>> matrix [[2, 2, 1, 1, 1, 1, 0], [2, 3, 1, 1, 1, 2, 1], [1, 1, 1, 0, 0, 0, 0], [1, 1, 0, 1, 1, 1, 0], [1, 1, 0, 1, 1, 1, 0], [1, 2, 0, 1, 1, 2, 1], [0, 1, 0, 0, 0, 1, 1]]

Or, if you were curious about saying the frequency “Benny” and “Bad”, you could ask matrix[word_to_index['Benny']][word_to_index['bad']] .

+3

dr jimbob Jul 03 '13 at 10:56

source share

qwwqwwq · Accepted Answer · 2013-07-03T23:21:06+0000

First, we mark the text, repeat each sentence and repeat all pairwise combinations of words in each sentence and save the calculations in a nested dict :

 from nltk.tokenize import word_tokenize, sent_tokenize from collections import defaultdict import numpy as np text = "Barbara is good. Barbara is friends with Benny. Benny is bad." sparse_matrix = defaultdict(lambda: defaultdict(lambda: 0)) for sent in sent_tokenize(text): words = word_tokenize(sent) for word1 in words: for word2 in words: sparse_matrix[word1][word2]+=1 print sparse_matrix >> defaultdict(<function <lambda> at 0x7f46bc3587d0>, { 'good': defaultdict(<function <lambda> at 0x3504320>, {'is': 1, 'good': 1, 'Barbara': 1, '.': 1}), 'friends': defaultdict(<function <lambda> at 0x3504410>, {'friends': 1, 'is': 1, 'Benny': 1, '.': 1, 'Barbara': 1, 'with': 1}), etc..

This essentially looks like a matrix in which we can index sparse_matrix['good']['Barbara'] and get the number 1 , and the index sparse_matrix['bad']['Barbara'] and get 0 , but we don't actually store counts for any words that have never been co-occured, 0 only generated by defaultdict only when you request it. This can really save a lot of memory while doing this stuff. If we need a dense matrix for some type of linear algebra or other computational reason, we can get it like this:

 lexicon_size=len(sparse_matrix) def mod_hash(x, m): return hash(x) % m dense_matrix = np.zeros((lexicon_size, lexicon_size)) for k in sparse_matrix.iterkeys(): for k2 in sparse_matrix[k].iterkeys(): dense_matrix[mod_hash(k, lexicon_size)][mod_hash(k2, lexicon_size)] = \ sparse_matrix[k][k2] print dense_matrix >> [[ 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 1. 1. 1. 1. 0. 1.] [ 0. 0. 1. 1. 1. 0. 0. 1.] [ 0. 0. 1. 1. 1. 1. 0. 1.] [ 0. 0. 1. 0. 1. 2. 0. 2.] [ 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 1. 1. 1. 2. 0. 3.]]

I would recommend looking up http://docs.scipy.org/doc/scipy/reference/sparse.html for other ways to solve the problem of sparseness of matrices.

Symmetric python word matrix using nltk

More articles: