I read < Fundamentals of Natural Language Statistical Processing . He has the following statement about the relationship between informational entropy and the language model:
... The essential point here is that if the model displays more structure of the language, then the entropy of the model should be lower. In other words, we can claim entropy as a measure of the quality of our models ...
But what about this example:
Suppose we have a machine that dials 2 characters, A and B, one after the other. And the designer of the machine does A and B has an equal probability.
I am not a designer. And I'm trying to simulate it using an experiment.
During the initial experiment, I see that the machine shares the following sequence of characters:
A, B, A
So, I model the car as P (A) = 2/3 and P (B) = 1/3. And we can calculate the entropy of this model as:
-2/3*Log(2/3)-1/3*Log(1/3)= 0.918 bit (the base is 2)
But then the designer will tell me about his design, so I clarified my model with this additional information. The new model looks like this:
P (A) = 1/2 P (B) = 1/2
And the entropy of this new model:
-1/2*Log(1/2)-1/2*Log(1/2) = 1 bit
The second model is obviously better than the first. But entropy has increased.
My point is that due to the arbitrariness of the tested model, we cannot blindly say that lower entropy indicates a better model.
Can anyone shed some light on this?
ADD 1
(Many thanks to Rob Neuhaus!)
Yes, after I re-digested the mentioned NLP book. I think now I can explain it.
, , . .
, , . p. -log (p) . . , 1000- 500 500 ,
1/3-2/3 :
[- 500 * log (1/3) - 500 * log (2/3)]/1000 = 1/2 * Log (9/2)
1/2-1/2 :
[- 500 * log (1/2) - 500 * log (1/2)]/1000 = 1/2 * Log (8/2)
, , 1/3, 2/3 , , , .
, , 1/2-1/2. , .
-, , .