The project you are referencing uses sequence_to_sequence_loss_by_example
, which returns the loss of cross entropy. Thus, to calculate perplexity in learning, you just need to amplify the loss, as described here .
train_perplexity = tf.exp(train_loss)
We should use e instead of 2 as the base, because TensorFlow measures the cross-entropy loss by the natural logarithm ( TF Documentation ). Thanks, @Matthias Arro and @Colin Skow for the tip.
Detailed explanation
The cross-entropy of the two probability distributions P and Q tells us the minimum average number of bits we need to encode P events when we design a coding scheme based on Q. Thus, P is a true distribution that we usually donβt know. We want to find Q as close to P as possible so that we can develop a good coding scheme with as many bits per event as possible.
I should not say bits, because we can only use bits as a measure, if we use base 2 in calculating cross-entropy. But TensorFlow uses the natural logarithm, so instead measure cross-entropy in nats .
So, let's say we have a bad language model in which each character (symbol / word) in the body is equally likely to be next. For a case of 1000 tokens, this model will have a cross-entropy of log (1000) = 6.9 nats. When predicting the next token, he should choose evenly between 1000 tokens at each step.
The best language model will determine the probability distribution of Q, which is closer to P. Thus, the cross-entropy is lower - we can get a cross-entropy of 3.9 nats. If we now want to measure perplexity, we simply index cross-entropy:
exp (3.9) = 49.4
So, on the samples for which we calculated the loss, a good model was just as vague as if she had to choose evenly and independently between about 50 tokens.