conditional entropy (thing) by Maayan

In the fields of information theory and machine learning, entropy describes how uniform or varied a data set is. Conditional entropy describes the entropy of one characteristic of a data set, given knowledge of another characteristic.

For example, suppose there is a data set with input X and output Y, where each data point is a student, X is the student's college major, and Y indicates whether or not the student enjoyed the movie "Gladiator".

       X          Y
      Math       Yes
      History    No
      CS         Yes
      Math       No
      Math       No
      CS         Yes
      History    No
      Math       Yes

The mathematical definition of entropy is given in the entropy node, but I will restate it here for convenience:

H(Z) = sum_{i = 1,...,n}(p_ilog₂(1/p_i) where Z is a variable that takes n values, and p_i is the probability of the i^th value of Z.

You can see from the data set above that there are equal numbers of "yes" and "no" instances of the class Y, so the entropy of Y is 1. The input variable X is more varied, so it's entropy is higher - H(X) = 1.5

The specific conditional entropy of Y is the entropy of Y computed only over the data points for which X has a specific value. For example, we could ask for the specific conditional entropy of Y over the records for which X = "Math". We would write this as H(Y | X = "Math"). The specific conditional entropies of Y for all values of X are:

H(Y | X = "Math") = 1
H(Y | X = "History") = 0
H(Y | X = "CS") = 0

The overall conditional entropy of Y is the average conditional entropy of Y over all values of X. Since we are dealing in probabilities, we can't simply take the mean of the specific conditional entropies listed above. Instead, we weight each specific conditional entropy by the probability that X takes the specified value:

H(Y | X) = sum_{i = 1,...,n} (Prob(X = v_i) * H(Y | X = v_i)) where X takes n values, and v_i is the i^th value of X.

In the case of the example,

 H(Y | X) = Prob(X = "Math") * H(Y | X = "Math") 
             + Prob(X = "History") * H(Y | X = "History") 
             + Prob(X = "CS") * H(Y | X = "CS")
          = (0.5 * 1) + (0.25 * 0) + (0.25 * 0)
          = 0.5

Conditional entropy is measured in units of bits, and is used to compute information gain.

All examples taken from Carnegie Mellon University Machine Learning lecture notes by Andrew Moore.

information gain	conditional probability	entropy	computational geometry
Hidden Markov Model	machine learning	Bayesian Network	probability
Carnegie Mellon University	Mars is barren	Camel Lights	pressure
information theory

conditional entropy (thing)

Recommended Reading

About Everything2

User Picks

Editor Picks

New Writeups