The **Kulback-Leibler divergence** (KLD) is a measure of the "difference" between two probability distributions.
Shortly put, the KLD between distributions f(.) and g(.) for a discrete distribution is

DKL(f,g) = Σ_{i}p_{i} log(p_{i})-p_{i} log(q_{i})

The formula for a continuous distribution is obtained by replacing the summation sign by an integral. But I want to stay with discrete distributions here because I want to show you a few cool things. Let's first define a code* *over a set A as a transformation that goes from elements i ∈ A to numbers k_{i} [0,∞) where

Σ_{i }exp(-k_{i}) ≤ 1

A code is said to be *complete *if the condition above holds with equality.

Let us define now **surprisals**. Simply put, a surprisal value measures how unlikely you find a result to be. Mathematically expressed, this is

I(x;P) = log(1/P(X)) = -log(P(X))

where P is the probability distribution you mentally attributed to the events.

A code is said to *match* a probability distribution if it matches its surprisals, i.e. if

k_{i} = I(x_{i};P)

What can we do with this framework? First we have the **linking identity. **Suppose we have two theories (probability distributions) P and Q and two codes K and C. Now suppose K and Q match. Let's denote for conciseness's sake the mathematical expectation (probability mean) of x under distribution y as <x,y>. Then we have the beautiful equation

<K,P> = H(P) + KLD(P,Q)

where <K,P> is the mean surprisal under theory P and H(P) = Σ_{i}p_{i} log(p_{i}) is the well-known **entropy** of a distribution. In other words, theory P is known to have an inherent quantity called "entropy", and the mean surprisal of a Q-matching code K increases linearly as the KLD increases. Doesn't this give you the jitters?

Secondly, we have the **partition inequality**. Let A_{i} refer to pairwise disjoint partitions (that is, subsets that don't overlap, but whose union covers the whole of A). Then let us consider the following quantity:

P_{A}(i) = Σ_{x ∈ Ai } P(x)

Then KLD(P,Q) ≥KLD(P_{A},Q_{A}). As a special case, consider the partition such that P_{1} = { x | p(x)>q(x) } and P_{2} = { x | p(x)≤q(x) }. Then we have, after a bit of tedious algebraic manipulation, that

KLD(P,Q) ≥ |p(x) - q(x)|^{2}

– that is, KLD dominates over the "total variation" metric of real analysis, and thus is more than the mere analytical divergence between two abstract functions. It's important to note that KLD is not a true metric (and thus is not called "distance", but "divergence", since

KLD(P,Q) ≠ KLD(Q,P)

KLD does satisfy the criteria for a premetric, and as such, a topology over probability distributions can be defined:

P_{n} → P if KLD(P_{n},P) → 0

Extensions to the KLD (like KLM(P,Q) = (KLD(P,Q) + KLD(Q,P)/2)have been proposed so it becomes a true metric, but I find that the asymmetry in having a "true theory" in your heart and assessing a secondary theory from there is beautiful, and in a way reflects the limits of our impartiality. In a certain way, the KLD premetric defines a topology of the heart, and while that might sound quacky, it makes me feel peaceful. May your "KLD topology" of the people you meet and learn to love be a joyous one.