How to learn the entropy of the English language

How to find out the entropy of the English language using isolated probabilities of language characters?

+6
source share
1 answer

If we define “isolated character probabilities” in how this is done in this SO answer , we will need to do the following:

  • Get a representative sample of text in English (possibly a carefully selected collection of news, blog posts, some scientific articles and individual personal letters), as far as possible

  • Iterate over your own characters and count the frequency of occurrence of each of them

  • Use the frequency divided by the total number of characters as an estimate of the probability of each character

  • Calculate the average bit length of each character by multiplying its probability by the negative logarithm of the same probability (base-2 logarithm if we want the unit of entropy to be a bit)

  • Take the sum of all the average lengths of all characters. This is the result.

Cautions:

  • This entropy of isolated characters is not what is commonly called the Shannon entropy estimate for English. Shannon based entropy on conditional n-gram probabilities, rather than isolated symbols, and his famous 1950 article was mainly about how to determine the optimal n.

  • Most people who try to evaluate the entropy of the English language exclude punctuation and normalize the entire text to lowercase.

  • The above assumes that a character is defined as a character (or letter) of the English language. You could do a similar thing for whole words or other units of text.

Code example:

Here is the Python code that implements the procedure described above. It normalizes lowercase text and excludes punctuation and any other non-alphabetic, non-whitespace character. It assumes that you have compiled a representative English corpus and provided it (ASCII encoded) on STDIN.

import re import sys from math import log # Function to compute the base-2 logarithm of a floating point number. def log2(number): return log(number) / log(2) # Function to normalise the text. cleaner = re.compile('[^az]+') def clean(text): return cleaner.sub(' ',text) # Dictionary for letter counts letter_frequency = {} # Read and normalise input text text = clean(sys.stdin.read().lower().strip()) # Count letter frequencies for letter in text: if letter in letter_frequency: letter_frequency[letter] += 1 else: letter_frequency[letter] = 1 # Calculate entropy length_sum = 0.0 for letter in letter_frequency: probability = float(letter_frequency[letter]) / len(text) length_sum += probability * log2(probability) # Output sys.stdout.write('Entropy: %f bits per character\n' % (-length_sum)) 
+12
source

Source: https://habr.com/ru/post/910183/


All Articles