The Kulback-Leibler divergence (KLD) is a measure of the "difference" between two probability distributions. Shortly put, the KLD between distributions f(.) and g(.) for a discrete distribution is

DKL(f,g) = Σipi log(pi)-pi log(qi)

The formula for a continuous distribution is obtained by replacing the summation sign by an integral. But I want to stay with discrete distributions here because I want to show you a few cool things. Let's first define a code over a set A as a transformation that goes from elements i ∈ A to numbers ki [0,∞) where

Σi exp(-ki) ≤ 1

A code is said to be complete if the condition above holds with equality.

Let us define now surprisals. Simply put, a surprisal value measures how unlikely you find a result to be. Mathematically expressed, this is

I(x;P) = log(1/P(X)) = -log(P(X))

where P is the probability distribution you mentally attributed to the events.

A code is said to match a probability distribution if it matches its surprisals, i.e. if

ki = I(xi;P)

What can we do with this framework? First we have the linking identity. Suppose we have two theories (probability distributions) P and Q and two codes K and C. Now suppose K and Q match. Let's denote for conciseness's sake the mathematical expectation (probability mean) of x under distribution y as <x,y>. Then we have the beautiful equation

<K,P> = H(P) + KLD(P,Q)

where <K,P> is the mean surprisal under theory P and H(P) = Σipi log(pi) is the well-known entropy of a distribution. In other words, theory P is known to have an inherent quantity called "entropy", and the mean surprisal of a Q-matching code K increases linearly as the KLD increases. Doesn't this give you the jitters?

Secondly, we have the partition inequality. Let Ai refer to pairwise disjoint partitions (that is, subsets that don't overlap, but whose union covers the whole of A). Then let us consider the following quantity:

PA(i) = Σx ∈ Ai P(x)

Then KLD(P,Q) ≥KLD(PA,QA). As a special case, consider the partition such that P1 = { x | p(x)>q(x) } and P2 = { x | p(x)≤q(x) }. Then we have, after a bit of tedious algebraic manipulation, that

KLD(P,Q) ≥ |p(x) - q(x)|2

– that is, KLD dominates over the "total variation" metric of real analysis, and thus is more than the mere analytical divergence between two abstract functions. It's important to note that KLD is not a true metric (and thus is not called "distance", but "divergence", since


KLD does satisfy the criteria for a premetric, and as such, a topology over probability distributions can be defined:

Pn → P if KLD(Pn,P) → 0

Extensions to the KLD (like KLM(P,Q) = (KLD(P,Q) + KLD(Q,P)/2)have been proposed so it becomes a true metric, but I find that the asymmetry in having a "true theory" in your heart and assessing a secondary theory from there is beautiful, and in a way reflects the limits of our impartiality. In a certain way, the KLD premetric defines a topology of the heart, and while that might sound quacky, it makes me feel peaceful. May your "KLD topology" of the people you meet and learn to love be a joyous one.