I am trying to convert the Python implementation of Viterbi's algorithm found in this answer to stack overflow in Ruby. The full script can be found at the end of this question with my comments.
Unfortunately, I know very little about Python, so the translation is more complicated than we would like. However, I have made some progress. Right now, the only line that completely melts in my brain is this:
prob_k, k = max((probs[j] * word_prob(text[j:i]), j) for j in range(max(0, i - max_word_length), i))
Can someone explain what he is doing?
Here is the full Python script:
import re
from itertools import groupby
def viterbi_segment(text):
probs, lasts = [1.0], [0]
for i in range(1, len(text) + 1):
prob_k, k = max((probs[j] * word_prob(text[j:i]), j) for j in range(max(0, i - max_word_length), i))
probs.append(prob_k)
lasts.append(k)
words = []
i = len(text)
while 0 < i:
words.append(text[lasts[i]:i])
i = lasts[i]
words.reverse()
return words, probs[-1]
def word_prob(word):
return dictionary.get(word, 0) / total
def words(text):
return re.findall('[a-z]+', text.lower())
dictionary = dict((w, len(list(ws)))
for w, ws in groupby(sorted(words(open('/usr/share/dict/words').read()))))
max_word_length = max(map(len, dictionary))
total = float(sum(dictionary.values()))
compounds = words(open('compounds.txt').read())
for comp in compounds:
print comp, ": ", viterbi_segment(comp)
source
share