I would first set up something like the following. Perhaps something to add to the tokenization; although there was no need for your example.
text = """Barbara is good. Barbara is friends with Benny. Benny is bad.""" allwords = text.replace('.','').split(' ') word_to_index = {} index_to_word = {} index = 0 for word in allwords: if word not in word_to_index: word_to_index[word] = index index_to_word[index] = word index += 1 word_count = index >>> index_to_word {0: 'Barbara', 1: 'is', 2: 'good', 3: 'friends', 4: 'with', 5: 'Benny', 6: 'bad'} >>> word_to_index {'Barbara': 0, 'Benny': 5, 'bad': 6, 'friends': 3, 'good': 2, 'is': 1, 'with': 4}
Then declare the matrix of the required size (word_count x word_count); possibly using numpy like
import numpy matrix = numpy.zeros((word_count, word_count))
or just a nested list:
matrix = [None,]*word_count for i in range(word_count): matrix[i] = [0,]*word_count
Note that this is complicated, and something like matrix = [[0]*word_count]*word_count will not work, as it will make a list with 7 references to the same internal array (for example, if you try this code and then do matrix[0][1] = 1 , you will find matrix[1][1] , matrix[2][1] , etc. will also be changed to 1).
Then you just need to sort through your sentences.
sentences = text.split('.') for sent in sentences: for word1 in sent.split(' '): if word1 not in word_to_index: continue for word2 in sent.split(' '): if word2 not in word_to_index: continue matrix[word_to_index[word1]][word_to_index[word2]] += 1
Then you will get:
>>> matrix [[2, 2, 1, 1, 1, 1, 0], [2, 3, 1, 1, 1, 2, 1], [1, 1, 1, 0, 0, 0, 0], [1, 1, 0, 1, 1, 1, 0], [1, 1, 0, 1, 1, 1, 0], [1, 2, 0, 1, 1, 2, 1], [0, 1, 0, 0, 0, 1, 1]]
Or, if you were curious about saying the frequency βBennyβ and βBadβ, you could ask matrix[word_to_index['Benny']][word_to_index['bad']] .