PMCC
A layman's guide
What Is This "Product Moment Correlation Coefficient"?
Despite it's unwieldy name, the PMCC represents something which is easy to state informally. It simply measures how strongly correlated two sets of data are. For example, we might expect the correlation between a person's height and their shoe size to be rather strong. On the other hand, the correlation of national wealth against average earlobe size is likely to be very poor. The PMCC provides us with a mathematically precise way to measure strength of correlation.
I will here state the formula for PMCC for those who are already familiar with the concept and simply need the formula. For everyone else, this will all be explained later:
r = Sxy
-----------,
√(Sxx · Syy)
_ _
where Sxy = Σxy - nx·y
How Is This Formula Derived?
Let's first examine a few example graphs:
A B C
y^ y^ y^
| . |. | . .
| . | . | .
| . | . | . .
| . | . | .
--+--------->x --+--------->x --+----------->x
| | |
We see three different situations here. In A, there is a distinct positive correlation present. As x increases, so too does y. In B, there is a strong negative correlation. As x increases, y decreases - and vice versa. In C, there is practically no correlation between the two sets of data x and y. We will be seeing these graphs again so keep them in the back of your mind.
Now, let us take A as an example and inflate it. We will mark it's mean point with an X. The "mean point" simply means the 'average' point. Using x with an overbar ("x-bar") to mean the average of x, and y-bar to represent the average of y, the mean point is at (x-bar, y-bar).
y^
| .
|
| .
| X
| . ^
| | Δy
| .<---'
| Δx
--+------------>x
|
We for each point we can measure its distance from the mean point. You can see I have marked on the distance of one point already. For simplicity we use the "map grid"-style distance, measuring the distance in one direction and then the other and adding the two. We call the difference in the y-direction Δy ("Delta y") and the difference in the x-direction Δx. These just mean "difference in y" and "difference in x", nothing more complex than that. We technically define them for any point (x,y) as follows:
Δx = x-bar - x
Δy = y-bar - y
For this particular point, both Δx and Δy are positive. For the top-right point, both Δx and Δy are negative. We can now notice something rather neat. For this graph, Δx times Δy is always positive: a positive times a positive is positive, and a negative times a negative is also positive. Now refer back to graph B. You can hopefully see that on that graph, Δx times Δy is negative for every point: either Δx is positive and Δy negative, or vice versa. Finally, for C, Δx times Δy varies in sign from point to point, there is no pattern.
This is very useful! We now have a mathematical way to distinguish graphs with positive correlation, negative correlation and no correlation. If we add up Δx times Δy for each point in the three graphs, we find another pattern: A gives us a large positive number, B gives us a large negative number, and C gives us a number near 0, as the positive and negative points mostly cancel out.
This provides us with a start for measuring correlation. Σ is mathematical notation for "add up all of", so our initial measure might be:
Σ(x-bar - x)(y-bar - y).
This gives a positive number for positive correlation, a negative number for negative correlation, and 0 for no correlation.
However, this still has some problems. Say we had two pairs of data and wished to compare their correlation. There is no guarantee that one pair will have as many points as the other. However, with our current measure, a large pair will inevitably produce a larger number simply because there are more points to add up! We can get round this by dividing by the number of points we have. Then we can compare pairs of different sizes fairly. Our revised formula would look something like:
Σ(x-bar - x)(y-bar - y)
-------------------------
n
where n is the number of points.
This is better but we still have a problem. Look at the following two sets of data:
D E
y^ y^
| |
| | .
| |
| | .
| . |
| . | .
| . |
| . | .
| |
--+------------------>x --+------------------->x
It's fair to say that both these sets have equally strong correlation. However, our current measure will give E a much larger correlation figure simply because the physical distance between the points is much greater. This is again a problem. We can resolve this by using something called the standard deviation. This is hopefully explained under that node but it rough terms, standard deviation measures how widely spread data is. If σx is large, x is widely spread like in E; if it is small, the data is closely packed like in D. Therefore, if we divide our current formula by σx and σy, we can correct for the unfairness presented by data with different spread. Our once-again revised formula is then:
Σ(x-bar - x)(y-bar - y)
-------------------------
nσxσy
This, at last, is what we call the Product Moment Correlation Coeffecient, which has the symbol r. However, calculating it this way is a lot of work. Therefore, we can do some mathematical manipulation and turn it into the following formula:
_ _
Σxy - nx·y
r = ____________________________
_ _
√[ (Σx² - nx²)(Σy² - ny²) ]
Despite the imposing nature of this formula, it actually gives just the same results as our own one. It is just much quicker to calculate.
Tell Me More About The PMCC!
PMCC has a number of interesting properties. For one, it always falls between 1 and -1, whatever data presented. Very strong correlation occurs when it gets close to either end of the scale. Weak correlation occurs around 0, as promised. Proving that -1 ≤ r ≤ 1 is not so simple!
Despite our optimisation at the end there, the PMCC is still a lot of work to calculate if you have a lot of points. Therefore, there is a related measure of correlation called Spearman's Rank Correlation Coeffecient that is much quicker to calculate but in some sense less accurate. You can read more about that there.
One for "Maths For The Masses", I think.