In the fields of information theory and machine learning, entropy describes how uniform or varied a data set is. Conditional entropy describes the entropy of one characteristic of a data set, given knowledge of another characteristic.

For example, suppose there is a data set with input X and output Y, where each data point is a student, X is the student's college major, and Y indicates whether or not the student enjoyed the movie "Gladiator".

       X          Y
      Math       Yes
      History    No
      CS         Yes
      Math       No
      Math       No
      CS         Yes
      History    No
      Math       Yes

The mathematical definition of entropy is given in the entropy node, but I will restate it here for convenience:

H(Z) = sumi = 1,...,n(pilog2(1/pi) where Z is a variable that takes n values, and pi is the probability of the ith value of Z.

You can see from the data set above that there are equal numbers of "yes" and "no" instances of the class Y, so the entropy of Y is 1. The input variable X is more varied, so it's entropy is higher - H(X) = 1.5

The specific conditional entropy of Y is the entropy of Y computed only over the data points for which X has a specific value. For example, we could ask for the specific conditional entropy of Y over the records for which X = "Math". We would write this as H(Y | X = "Math"). The specific conditional entropies of Y for all values of X are:

H(Y | X = "Math") = 1
H(Y | X = "History") = 0
H(Y | X = "CS") = 0

The overall conditional entropy of Y is the average conditional entropy of Y over all values of X. Since we are dealing in probabilities, we can't simply take the mean of the specific conditional entropies listed above. Instead, we weight each specific conditional entropy by the probability that X takes the specified value:

H(Y | X) = sumi = 1,...,n (Prob(X = vi) * H(Y | X = vi)) where X takes n values, and vi is the ith value of X.

In the case of the example,

 H(Y | X) = Prob(X = "Math") * H(Y | X = "Math") 
             + Prob(X = "History") * H(Y | X = "History") 
             + Prob(X = "CS") * H(Y | X = "CS")
          = (0.5 * 1) + (0.25 * 0) + (0.25 * 0)
          = 0.5

Conditional entropy is measured in units of bits, and is used to compute information gain.

All examples taken from Carnegie Mellon University Machine Learning lecture notes by Andrew Moore.