How to learn the entropy of the English language

Question

How to learn the entropy of the English language

How to find out the entropy of the English language using isolated probabilities of language characters?

+6

entropy nlp

wcwchamara Mar 07 '12 at 15:40

source share

1 answer

jogojapan · Answer 1 · 2012-03-08T02:50:25+0000

If we define “isolated character probabilities” in how this is done in this SO answer , we will need to do the following:

Get a representative sample of text in English (possibly a carefully selected collection of news, blog posts, some scientific articles and individual personal letters), as far as possible
Iterate over your own characters and count the frequency of occurrence of each of them
Use the frequency divided by the total number of characters as an estimate of the probability of each character
Calculate the average bit length of each character by multiplying its probability by the negative logarithm of the same probability (base-2 logarithm if we want the unit of entropy to be a bit)
Take the sum of all the average lengths of all characters. This is the result.

Cautions:

This entropy of isolated characters is not what is commonly called the Shannon entropy estimate for English. Shannon based entropy on conditional n-gram probabilities, rather than isolated symbols, and his famous 1950 article was mainly about how to determine the optimal n.
Most people who try to evaluate the entropy of the English language exclude punctuation and normalize the entire text to lowercase.
The above assumes that a character is defined as a character (or letter) of the English language. You could do a similar thing for whole words or other units of text.

Code example:

Here is the Python code that implements the procedure described above. It normalizes lowercase text and excludes punctuation and any other non-alphabetic, non-whitespace character. It assumes that you have compiled a representative English corpus and provided it (ASCII encoded) on STDIN.

import re import sys from math import log # Function to compute the base-2 logarithm of a floating point number. def log2(number): return log(number) / log(2) # Function to normalise the text. cleaner = re.compile('[^az]+') def clean(text): return cleaner.sub(' ',text) # Dictionary for letter counts letter_frequency = {} # Read and normalise input text text = clean(sys.stdin.read().lower().strip()) # Count letter frequencies for letter in text: if letter in letter_frequency: letter_frequency[letter] += 1 else: letter_frequency[letter] = 1 # Calculate entropy length_sum = 0.0 for letter in letter_frequency: probability = float(letter_frequency[letter]) / len(text) length_sum += probability * log2(probability) # Output sys.stdout.write('Entropy: %f bits per character\n' % (-length_sum))

How to learn the entropy of the English language

More articles: