In the fields of information theory and machine learning, entropy describes how uniform or varied a data set is. Conditional entropy describes the entropy of one characteristic of a data set, given knowledge of another characteristic.

For example, suppose there is a data set with input `X` and output `Y`, where each data point is a student, `X` is the student's college major, and `Y` indicates whether or not the student enjoyed the movie "Gladiator".

`X` `Y`
Math Yes
History No
CS Yes
Math No
Math No
CS Yes
History No
Math Yes

The mathematical definition of entropy is given in the entropy node, but I will restate it here for convenience:

`H`(`Z`) = sum_{i = 1,...,n}(`p`_{i}log_{2}(1/`p`_{i}) where `Z` is a variable that takes `n` values, and `p`_{i} is the probability of the `i`^{th} value of `Z`.

You can see from the data set above that there are equal numbers of "yes" and "no" instances of the class `Y`, so the entropy of `Y` is 1. The input variable `X` is more varied, so it's entropy is higher - `H`(`X`) = 1.5

The specific conditional entropy of `Y` is the entropy of `Y` computed only over the data points for which `X` has a specific value. For example, we could ask for the specific conditional entropy of `Y` over the records for which `X` = "Math". We would write this as `H`(`Y` | `X` = "Math"). The specific conditional entropies of `Y` for all values of `X` are:

`H`(`Y` | `X` = "Math") = 1

`H`(`Y` | `X` = "History") = 0

`H`(`Y` | `X` = "CS") = 0

The overall conditional entropy of `Y` is the average conditional entropy of `Y` over all values of `X`. Since we are dealing in probabilities, we can't simply take the mean of the specific conditional entropies listed above. Instead, we weight each specific conditional entropy by the probability that `X` takes the specified value:

`H`(`Y` | `X`) = sum_{i = 1,...,n} (Prob(`X` = `v`_{i}) * `H`(`Y` | `X` = `v`_{i})) where `X` takes `n` values, and `v`_{i} is the `i`^{th} value of `X`.

In the case of the example,

`H`(`Y` | `X`) = Prob(`X` = "Math") * `H`(`Y` | `X` = "Math")
+ Prob(`X` = "History") * `H`(`Y` | `X` = "History")
+ Prob(`X` = "CS") * `H`(`Y` | `X` = "CS")
= (0.5 * 1) + (0.25 * 0) + (0.25 * 0)
= 0.5

Conditional entropy is measured in units of bits, and is used to compute information gain.

All examples taken from Carnegie Mellon University Machine Learning lecture notes by Andrew Moore.