The Entropy of a Language

Entropy is basically a measure of randomness. This is all well and good, but this results in it being a measure of several other things which really all turn out to be the same thing. Rather than repeat the obscure and uninformative definition I was treated to in no fewer than 3 courses this year, here is an exploration of the way I understand the concept of Shannon entropy.

Let's reduce our alphabet to just 28 characters: letters a to z, the space and the full stop. With this, we can write words and arrange them into sentences. Take a string of these characters of length n. How many possible strings exist? 28n (because for each character we have 28 choices). We then ask ourselves how many of these strings are legal in a given language. The higher the proportion of legal strings, the higher the entropy.

Note that here we define a language as a set of rules which define whether a string is legal or not. This includes natural languages such as English, binary code, random languages and the rule that any string is permitted so long as it contains alternating vowels and consonants. You should also note that the space full stop characters are being treated like any other, so among the possible strings are "..  ..   " and "   d     ".

Entropy is a measure of surprise

Suppose we choose a language where only strings containing only the letter l are allowed. You are reading a book written in this language. You have just finished the first page (which was covered in ls). Quick! Before, turning to the next page, what is the following character going to be?

Were you surprised at discovering the l on the following page? No? This is because such a language has zero entropy. Now consider a language where all strings are allowed. Can you predict the following character in this string? uqoaxul jtb.yjhdcn. Because it could be any of the 28, you have only a slim chance of being correct. This is the language whose entropy is maximal. The concept carries over to English: Like most natural languages, English has a relatively low entropy, which is why you are easily able to guess the last 6 characters of this sen... Because many natural languages have a fixed set of legal words, it is more interesting to look at word entropy. But for purposes of simplicity, we'll stick to character entropy in this article.

Entropy is a measure of information

Now, let's see how much information is carried by the two languages discussed in the previous paragraph. Measuring information is rather tricky. Clearly, in the language composed only of bs each character conveys no information, because if you know what the language is, you already know what character come next. We'll give it an entropy of 0: no information per character. Remember twenty questions? That we are able to define a concept with yes-no questions shows that one answer could be defined as a piece of information. As such, a random binary code has an entropy of 1 bit per character. This will be our measure of entropy.

You might think that, in a 28 character alphabet, each character conveys 28 pieces of information; namely the presence of one and the absence of the others. This is not quite correct: Because the presence of one character automatically forces the others' absence, we don't get that much information. It is as if I told you that a cloud was white and then went on to say it wasn't black: no additional information. Binary code again shows us how to measure the entropy of our random language: because it takes log2n bits to code a n-character alphabet, the random language has an entropy per character of 4.8. That means that a 28-character alphabet can encode a maximum of 4.8 bits (or pieces) of information with each character.

What is entropy for?

In natural languages, entropy is quite hard to measure. Most of them have an entropy somewhere around 1.5 bits per character. It might seem that this is a bit of a waste (entropy at higher level, such as word and sentence is also very low). But this high level of redundancy is what allows us to only read the shape of words; what allows us to understand each other even in noisy environments. Computers who recognize human speech try to use this low entropy in order to make sure they didn't misinterpret a sound; but this doesn't always work, as the redundancy only exists because a very large number of rules comes with the language and it is difficult to tell a computer about all of them.

In fact, the only point where low entropy is a problem is in computer science. When saving or sending information as plain text, it is not random: it has the rules of English, Java or XML and so contains much redundant information. This is what compression programs do: they take a file with low entropy and convert it into a language (set of rules) which has a higher entropy, thus saving space.

Entropy also comes in handy when defining the concept of unicity distance in cryptography. This basically tells us that if we know what cipher is used, and that we try each of the keys in turn to decode the secret message and if we know that the decoded message should be in English, then because of English's low entropy, it is highly unlikely there be more than one decoding which actually follows the rules of English. This problem is usually solved by having a very high number of keys, thus increasing the effort needed to decode with each of them in turn and increasing the probability of having more than one meaningful output.

Does this entropy have anything to do with the entropy I learnt in physics?

Yes and no. Language entropy is obviously not a physical concept and has little to do with thermodynamics. Physical entropy, however, can also be seen as a measure of randomness or information. The classical example I learnt was that of having a basin full of red and blue balls: The second law of thermodynamics tells us that the basin will eventually become a random mix of red and blue balls. So we can liken this random state to the random language, where each particle contains maximal information (it's position). And we can liken the state where the red balls are on one side and the blue on the other to the b language, where each ball tells us nothing, because we already know where it is. As noded elsewhere, physical entropy doesn't work at ball-sized level but at particle level.

This concludes our little foray into the world of entropy. Just remember that, as in all things, there are formal definitions and layman's explanations. When you do learn about language Entropy in a CS course, you can only use this as a conceptual aid: Entropy is a measure of randomness, information, surprise and number of rules.

This node brought to you by 3 CS courses which left me thoroughly confused and one kind professor who had me help him write next year's course thanks to whom I finally understood entropy by having to write an introduction to it myself. (That introduction was specifically for cryptography and thus bears little resemblance to this article.)