standardizing data (idea) by kwerkey

Standardizing data is a method in statistics for comparing two normal distributions if they have different arithmetic means and/or standard deviations. It can also be used to tell the probability of a data point being within a range of values.

What the hell?

What that means is that, if you have a normal distribution of data, you can find out the percent of data that is less than (also called "below" or "to the left of") or greater than ("above", "to the right of") any given value.

A normal distribution is one wherein the graph follows a normal curve (a bell curve). A normal curve is defined by the equation:

       1               x - μ
y = -------- e^( -.5 (-------)^2)
    σ √(2π)             σ

Or, for the <pre> impared, y=e^(-.5((x-μ)/σ)^2) / (σ √(2π))

Where μ is the mean and σ is the standard deviation. In order to work with probability, the area under the curve is always 1, regardless of μ and σ. If you were to shade in an area under the curve and find the area of that, you would have the probability of a random data point being in that region (say, "60% of all scores lie between 84 and 95"). Another way of thinking about this is to find in which percentile x lies. Because of this formula, a normal curve can be specified with the notation N(μ, σ). Because this curve is also symmetrical about μ, the mean is also the median, so one can speak of 50% of the scores being below μ and 50% being above μ.

If you don't understand what this looks like, here is a horrible ASCII representation:

                 |
              .'¯|¯'.
             /   |   \
            /    |----\
          ..     | σ   ..
 ___...'''       |       '''...__
-----------------|----------------
                 μ

As a side note, μ ± σ are points of inflection of the curve.

Standardizing Normal Distributions

I'm not sure how to explain standardizing without giving an application, so I won't. Say you have a class of an infinite number of students (okay, maybe not a very likely application; pretend the actual number is just very large) and you have given them a test. The scores approximate a normal distribution with a mean score of 100 points and a standard deviation of 10 points; this can also be written N(100, 10). 50% of students scored 100 points or less. What if you want to find out what percent of students who scored below 110? What you need to do is find the area under the curve and less than 110:

                _|_
              .'/|/'.
             ////|///\
            /////|////\
          ../////|////|..
 ___...'''///////|////|  '''...__
/////////////////|////|
-----------------|----|----------
                100  110
                (μ)

There are three ways of doing this. The most common way is probably to use a graphing calculator such as a TI-83 Plus. There is also a table method, which is what I will explain. It's also possible to do numerical integration if you happen to like doing things the hard way. Gorgonzola would like me to emphasize that the normal distribution curve has no integral.

Standardized data points (also called Z scores) disregard actual values and instead just tell how many standard deviations the point is away from the mean. You find the difference between the point you picked and the mean, then divide by the standard deviation:

     x - μ
z = -------
       σ

z is just the canonical variable used for this process. Since we're trying to find out what is below 110, we subtract the mean (110-100=10) and divided by σ (10/10=1). This point is +1 standard deviation away from the mean, so z=1. If the value were negative, that would show that the point is below the mean. Now that we know that, we can use the table found at the bottom of this writeup. Look up 1.00 in the table.

Done? The table changes our z value, 1, into .8413. That is the percentage. So 84.13% of students scored less than 110 on the test. If we wanted to find out how many scored more than 110, just subtract 1 from the table value. 1.0000 - .8413 = .1587. 15.87% of students scored 110 or better.

Un-standardizing Data

That's all well and nice and politically correct, but what if you wanted to find out how well 35% of the class did? With the same N(100,10) data we had before we can do the steps backwards to get a test grade. Finding out the top 35% of the class is the same as finding the bottom 65%. Look up .6500 on the INSIDE of the table and then find the z value. Well, the table only has .6480 and .6517, so just approximate z to be about 0.385. Plug this into the formula above, z=(x-μ)/σ:

z     = (x -   μ) /  σ
0.385 = (x - 100) / 10
.385(10) = x - 100
x = 103.85

So 35% of the class got a 103.85 on the test or better, and the other 65% did poorer than 103.85.

Standardized Table

To read this, round z to two decimal places. Match up the ones and tenths digits on the left with the hundredths digit along the top. The value inside the table is a probability ( %/100 ). To find negative z values, look up |z| then subtract from 1 (if z=-0.12, then P(z)= 1-.5478 = .4522)

  z  |  .00    .01    .02    .03    .04    .05    .06    .07    .08    .09
----------------------------------------------------------------------------
 0.0 | .5000  .5040  .5080  .5120  .5160  .5199  .5239  .5279  .5319  .5359
 0.1 | .5398  .5438  .5478  .5517  .5557  .5596  .5636  .5675  .5714  .5753
 0.2 | .5793  .5832  .5871  .5910  .5948  .5987  .6026  .6064  .6103  .6141
 0.3 | .6179  .6217  .6255  .6293  .6331  .6368  .6406  .6443  .6480  .6517
 0.4 | .6554  .6591  .6628  .6664  .6700  .6736  .6772  .6808  .6844  .6879
 0.5 | .6915  .6950  .6985  .7019  .7054  .7088  .7123  .7157  .7190  .7224
 0.6 | .7257  .7291  .7324  .7357  .7389  .7422  .7454  .7486  .7517  .7549
 0.7 | .7580  .7611  .7642  .7673  .7704  .7734  .7764  .7794  .7823  .7852
 0.8 | .7881  .7910  .7939  .7967  .7995  .8023  .8051  .8078  .8106  .8133
 0.9 | .8159  .8186  .8212  .8238  .8264  .8289  .8315  .8340  .8365  .8389
     |
 1.0 | .8413  .8438  .8461  .8485  .8508  .8531  .8554  .8577  .8599  .8621
 1.1 | .8643  .8665  .8686  .8708  .8729  .8749  .8770  .8790  .8810  .8830
 1.2 | .8849  .8869  .8888  .8907  .8925  .8944  .8962  .8980  .8997  .9015
 1.3 | .9032  .9049  .9066  .9082  .9099  .9115  .9131  .9147  .9162  .9177
 1.4 | .9192  .9207  .9222  .9236  .9251  .9265  .9279  .9292  .9306  .9319
 1.5 | .9332  .9345  .9357  .9370  .9382  .9394  .9406  .9418  .9429  .9441
 1.6 | .9452  .9463  .9474  .9484  .9495  .9505  .9515  .9525  .9535  .9545
 1.7 | .9554  .9564  .9573  .9582  .9591  .9599  .9608  .9616  .9625  .9633
 1.8 | .9641  .9649  .9656  .9664  .9671  .9678  .9686  .9693  .9699  .9706
 1.9 | .9713  .9719  .9726  .9732  .9738  .9744  .9750  .9756  .9761  .9767
     |
 2.0 | .9772  .9778  .9783  .9788  .9793  .9798  .9803  .9808  .9812  .9817
 2.1 | .9821  .9826  .9830  .9834  .9838  .9842  .9846  .9850  .9854  .9857
 2.2 | .9861  .9864  .9868  .9871  .9875  .9878  .9881  .9884  .9887  .9890
 2.3 | .9893  .9896  .9898  .9901  .9904  .9906  .9909  .9911  .9913  .9916
 2.4 | .9918  .9920  .9922  .9925  .9927  .9929  .9931  .9932  .9934  .9936
 2.5 | .9938  .9940  .9941  .9943  .9945  .9949  .9948  .9949  .9951  .9952
 2.6 | .9953  .9955  .9956  .9957  .9959  .9960  .9961  .9962  .9963  .9964
 2.7 | .9965  .9966  .9967  .9968  .9969  .9970  .9971  .9972  .9973  .9974
 2.8 | .9974  .9975  .9976  .9977  .9977  .9978  .9979  .9979  .9980  .9981
 2.9 | .9981  .9982  .9982  .9983  .9984  .9984  .9985  .9985  .9986  .9986
     |
 3.0 | .9987  .9987  .9987  .9988  .9988  .9989  .9989  .9989  .9990  .9990
 3.1 | .9990  .9991  .9991  .9991  .9992  .9992  .9992  .9992  .9993  .9993
 3.2 | .9993  .9993  .9994  .9994  .9994  .9994  .9994  .9995  .9995  .9995
 3.3 | .9995  .9995  .9995  .9996  .9996  .9996  .9996  .9996  .9996  .9997
 3.4 | .9997  .9997  .9997  .9997  .9997  .9997  .9997  .9997  .9997  .9998

Yates, Daniel, David Moore, and George McCabe. The Practice of Statistics.

normal distribution	μ	percentile	z score
The Bell Curve	Σ	probability	graphing calculator
statistics	arithmetic mean	Normal Curve	numerical integration
point of inflection	null hypothesis	Type II error	standard deviation
Symmetrical	Median	Infinite	XML
Random