Here is a solution to your problem using networkX.
This code will generate a directed graph, dG. dG will have one node per word, with a count attribute indicating how many times this word has been seen. It will have one directed front per one bigram, with the attribute "weight" indicating how many times the bigram has occurred.
import networkx as nx import string from sys import maxint str1 = """Man who run in front of car, get tired. Man who run behind car, get exhausted.""" wordList1 = str1.split(None) wordList2 = [string.rstrip(x.lower(), ',.!?;') for x in wordList1] dG = nx.DiGraph() for i, word in enumerate(wordList2): try: next_word = wordList2[i + 1] if not dG.has_node(word): dG.add_node(word) dG.node[word]['count'] = 1 else: dG.node[word]['count'] += 1 if not dG.has_node(next_word): dG.add_node(next_word) dG.node[next_word]['count'] = 0 if not dG.has_edge(word, next_word): dG.add_edge(word, next_word, weight=maxint - 1) else: dG.edge[word][next_word]['weight'] -= 1 except IndexError: if not dG.has_node(word): dG.add_node(word) dG.node[word]['count'] = 1 else: dG.node[word]['count'] += 1 except: raise
To view the contents of a graph, you can print nodes and edges.
for node in dG.nodes(): print '%s:%d\n' % (node, dG.node[node]['count']) for edge in dG.edges(): print '%s:%d\n' % (edge, maxint - dG.edge[edge[0]][edge[1]]['weight'])
Note that the weight of the bigram edge, instead of counting from zero, is counted from maxint. This is because the networkX shortest path utilities will use the weight value as the cost for each edge. By making our highest estimates the smallest weights, we can use these utilities to find the path with the highest edge values.
for example, we can get the path with the highest counter between the word "man" and the word "exhausted":
>>>shortest_path = nx.shortest_path(dG, source='man', target='exhausted', weight='weight') >>>print shortest_path ['man', 'who', 'run', 'behind', 'car', 'get', 'exhausted']
or we can get the paths with the highest number of counts between the word "man" and all other words:
>>>shortest_paths = nx.shortest_path(dG, source='man', weight='weight') >>>print shortest_paths {'run': ['man', 'who', 'run'], 'get': ['man', 'who', 'run', 'behind', 'car', 'get'], 'exhausted': ['man', 'who', 'run', 'behind', 'car', 'get', 'exhausted'], 'car': ['man', 'who', 'run', 'behind', 'car'], 'who': ['man', 'who'], 'behind': ['man', 'who', 'run', 'behind'], 'of': ['man', 'who', 'run', 'in', 'front', 'of'], 'in': ['man', 'who', 'run', 'in'], 'front': ['man', 'who', 'run', 'in', 'front'], 'tired': ['man', 'who', 'run', 'behind', 'car', 'get', 'tired'], 'man': ['man']}
As noted above, there is the possibility of getting cycles on a graph like this, and I'm not sure how well the network shortest path algorithm will handle this case.
Good luck