If we define “isolated character probabilities” in how this is done in this SO answer , we will need to do the following:
Get a representative sample of text in English (possibly a carefully selected collection of news, blog posts, some scientific articles and individual personal letters), as far as possible
Iterate over your own characters and count the frequency of occurrence of each of them
Use the frequency divided by the total number of characters as an estimate of the probability of each character
Calculate the average bit length of each character by multiplying its probability by the negative logarithm of the same probability (base-2 logarithm if we want the unit of entropy to be a bit)
Take the sum of all the average lengths of all characters. This is the result.
Cautions:
This entropy of isolated characters is not what is commonly called the Shannon entropy estimate for English. Shannon based entropy on conditional n-gram probabilities, rather than isolated symbols, and his famous 1950 article was mainly about how to determine the optimal n.
Most people who try to evaluate the entropy of the English language exclude punctuation and normalize the entire text to lowercase.
The above assumes that a character is defined as a character (or letter) of the English language. You could do a similar thing for whole words or other units of text.
Code example:
Here is the Python code that implements the procedure described above. It normalizes lowercase text and excludes punctuation and any other non-alphabetic, non-whitespace character. It assumes that you have compiled a representative English corpus and provided it (ASCII encoded) on STDIN.
import re import sys from math import log
source share