Suppose you have a data set comprised of entries where each entry is an input, X and an output, Y. Information gain is the quantity that answers the question, "How 'useful' is X when predicting Y?"
"Useful" is a relatively fuzzy word. To better understand information gain, it may be helpful to reformulate the question as, "When transmitting a binary representation of Y, how many bits can I save if both I and the person receiving the transmission know the value of X?"
Consider the following example. (Note that I am using the same example that I used in my description of conditional entropy. This is because conditional entropy is an important concept used to determine information gain, so it may be helpful to examine both writeups together.) In this example, every data point in the set is a student, with X being a variable that indicates the student's college major, and Y being a variable that indicates whether or not the student enjoyed the movie "Gladiator".
If I wanted to transmit the value of Y, I could do it with one bit, either a 1 if the student liked "Gladiator", or a 0 if he did not. But a closer examination of the data reveals something interesting - by knowing the value of X for a given student, I gain a great deal of information about the corresponding value of Y. For example, if I know that a student is a History major, I can conclude that he did not like "Gladiator", whereas with a Computer Science student, I can be equally certain that he did like the movie.
To calculate the numeric value of information gain, one must calculate the total entropy of the class being transmitted, and the conditional entropy of the class given the input characteristic. Since I worked out the computations for the above example in the conditional entropy node, I will not repeat them here. The results are:
H(Y) = 1 (It's important to note that this is the same as the number of bits needed to transmit Y given no additional information.)
H(Y | X) = 0.5
The formula for information gain is:
IG(Y | X) = H(Y) - H(Y | X)
For this example, IG
) = 0.5. That means that given a knowledge of X
, I can transmit Y
with, on average
, half a bit per data point
Information gain is most commonly used in the fields of information theory and machine learning. For example, when building a decision tree, information gain is used to choose the most valuable input characteristic for "splitting".
It is important to note that information gain can never be negative. After all, how can an additional piece of information make you less likely to be able to predict the output?
All examples taken from Carnegie Mellon University Machine Learning lecture notes by Andrew Moore.