Need help understanding this Python Viterbi algorithm

I am trying to convert the Python implementation of Viterbi's algorithm found in this answer to stack overflow in Ruby. The full script can be found at the end of this question with my comments.

Unfortunately, I know very little about Python, so the translation is more complicated than we would like. However, I have made some progress. Right now, the only line that completely melts in my brain is this:

prob_k, k = max((probs[j] * word_prob(text[j:i]), j) for j in range(max(0, i - max_word_length), i))

Can someone explain what he is doing?

Here is the full Python script:

import re
from itertools import groupby

# text will be a compound word such as 'wickedweather'.
def viterbi_segment(text):
  probs, lasts = [1.0], [0]

  # Iterate over the letters in the compound.
  # eg. [w, ickedweather], [wi, ckedweather], and so on.
  for i in range(1, len(text) + 1):
    # I've no idea what this line is doing and I can't figure out how to split it up?
    prob_k, k = max((probs[j] * word_prob(text[j:i]), j) for j in range(max(0, i - max_word_length), i))
    # Append values to arrays.
    probs.append(prob_k)
    lasts.append(k)

  words = []
  i = len(text)
  while 0 < i:
    words.append(text[lasts[i]:i])
    i = lasts[i]
  words.reverse()
  return words, probs[-1]

# Calc the probability of a word based on occurrences in the dictionary.
def word_prob(word):
  # dictionary.get(key) will return the value for the specified key.
  # In this case, thats the number of occurances of thw word in the
  # dictionary. The second argument is a default value to return if
  # the word is not found.
  return dictionary.get(word, 0) / total

# This ensures we ony deal with full words rather than each
# individual letter. Normalize the words basically.
def words(text):
  return re.findall('[a-z]+', text.lower())

# This gets us a hash where the keys are words and the values are the
# number of ocurrances in the dictionary.
dictionary = dict((w, len(list(ws)))
  # /usr/share/dixt/words is a file of newline delimitated words.
  for w, ws in groupby(sorted(words(open('/usr/share/dict/words').read()))))

# Assign the length of the longest word in the dictionary.
max_word_length = max(map(len, dictionary))

# Assign the total number of words in the dictionary. It a float
# because we're going to divide by it later on.
total = float(sum(dictionary.values()))

# Run the algo over a file of newline delimited compound words.
compounds = words(open('compounds.txt').read())
for comp in compounds:
  print comp, ": ", viterbi_segment(comp)
+2
source share
1 answer

You are looking at a list comprehension .

The extended version looks something like this:

all_probs = []

for j in range(max(0, i - max_word_length), i):
    all_probs.append((probs[j] * word_prob(text[j:i]), j))

prob_k, k = max(all_probs)

, . , , .

+1

Source: https://habr.com/ru/post/1663169/


All Articles