Have you let statistics into your heart yet? You don't know what you're missing!

I'll fix that!

What are these 'statistics' things?

The field of statistics is concerned with summarizing and drawing conclusions from data. Statistics are numbers or conclusions drawn from data that describe a trait of that data. Generally you would use quite a few statistics to describe data, and use quite a few to test it also. Data can vary in quantity (from a few measurements to a country's census papers), statistics can be used to summarize the data in a way that is much easier for a mere human to understand.

So what?

So, you can use statistics to see if your toe is of average size, discover how many people-of-prefered-sex-type you can expect to sleep with this week or almost whatever else you want. And, it's mostly only simple mathematics so any monkey on a typewriter can get lai....do it!

An introduction to statistics...

Firstly I will focus on the summarization of data seeing as it's fundamental. You've probably heard of the average height of something, or the range of scores in a sport. These are two summaries of data. In statistics, 'average' is generally the wrong word for what you are likely to have heard, 'mean' is probably the better choice.

There are a few types of averages that we can use to describe data:

  • Mean
  • Median
  • Mode
The mean is what most people refer to as the 'average value'. To get the mean of a sample, you add the values in the sample together and divide the total by the number of values in the sample. Or, to put it mathematically: Σ( xi/n) where xi is element `i' of the sample and `n' is the number of elements in the sample.

The median is the middle value of the sample, that means that half of the observations in the sample are larger then the median, and half are smaller. This is a useful tool when your sample is skewed (not symmetrical). Due to the fact that the mean cannot resist the influence of extreme observations (ie. it is not a resistant measurement), the median is known as a resistant measure of center. So, to find the median, could (n+1)/2 observations up from the ordered list if n is odd. If n is even then the median is the mean of the two middle values.

The mode of a sample is the value that appears the most frequently. For example, the mode of the following data is 6:
1, 2, 3, 3, 4, 5, 6, 6, 6, 6, 7, 7, 8, 9, 9, 9, 387265325
Because six occurs the most often.

You can use the above 'averages' to measure the center of a sample, but then, in order to describe the center, you'll want to describe the spread. There are two good ways to do that:
Quartiles are an extension of the median idea. The idea is that you find the median of a set and then find the median of the top and bottom halves:

 1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20
             |               |              | 
             Q1             Median          Q3
 1  2  3  4  5  6  7  8  9  10 10 10 10 10 11 11 18 19 20 20 
             |               |              |
             Q1             Median          Q3
In the top set you can see that the data is symmetrical so, if you drew a graph it would look nice and perty. You'll also notice that the quartiles are labeled Q1 and Q3. You can probably guess that the median could be called Q2. The difference between the minimum and the maximum is known as the range (in the above examples: min - max = 20 - 1 = 19). And the difference between Q3 and Q1 is known as the interquartile range. Both measurements are useful for determining the spread of the data.

Standard Deviations
Another useful measurement of spread is the Standard Deviation, usually represented as s or σ. This is much more informative than the boxplot, but only really applicable to more symmetrical samples, hence it isn't a resistive measurement. The standard deviation can be thought of as a mean deviance from the center (as measured by the mean). The mathematics behind the standard deviation are a bit more complex, but nothing too hard still:
√(Σ (xi - xmean)2 / (n-1) )
Where xi is element `i' of the sample, xmean is the mean of the sample and `n' is the number of observations in the sample. The reason why it's n-1, not n is due to the number of degrees-of-freedom there are in a sample, however, if you are working out the standard deviation for a population (not a sample from the population) then it is not 'n-1', it is just 'n'. The mathematics of that are beyond the scope of this writeup. Trust me.... But I should probably explain what those symbols are saying. You need to go through the sample, one observation at a time, and subtract the mean from it and square the result. You need to add each of those to each other to obtain the total and then divide that by the number of observation - 1. Then you find the square root of it.

So we can get numbers up about samples, but one of the ideas behind statistics is easy interpretation. So how do we get meaning from these numbers? Well, for each of the above methods there is an associated diagram that can assist our consumption of the data:
Boxplots are simple and yet quite informative. They serve to put the quartiles into a diagram:


 0  1  2  3  4  5  6  7  9  10 11 12 13 14 15 16 17 18 19 20


    ^        ^        ^      ^                    ^
   Min:      Q1:   Median:   Q3                   Max:
    1         4       7      10                    17

In the diagram above you can get a feel for the distribution of the data and you don't even need to read the numbers!

Bell curve
Another diagram used to visualise data is the bell curve. It's more complex than the boxplot, and not quite as intuitive, but still quite useful:

Sample statistics:
Mean: 10, Std deviation: 2.3

                /    |   \
              /------|     \
            /  |     |       \
          -/   |     |        \-
        _/     |     |          \_
               |     ^
               |    Mean = 10
              Std. dev=2.3

Yeah, now quite as simple as the boxplot, and harder to draw with text. As with the normal probability distribution, the ends never actually touch the x-axis and the area below the curve is 1. A larger standard deviation produces a fatter curve, and a smaller a thinner curve. Of course, the curve doesn't always have to be symmetrical, if the data is skewed then you'll have a peak in some place other then at the mean. However, the mean always describes where the curve would balance.
The normal bell curve becomes more accurate a description with the more data you have in your sample. Actually, the central limit theorum states that as n tends toward infinity, the sample becomes more normally distributed.

Some Applictions of Statistics

One of the more useful aspects of statistics is the confidence interval. It uses properties of probability and the standard deviation to give intervals that cover a given percentage of a population from which a sample is taken. This isn't immediatly useful (unless that's what you're looking for), but can be used to test statistical hypotheses. Examples of the kinds of hypotheses that statistics can test are: But first:

Confidence Intervals

Confidence intervals allow us to estimate a statistic with a known level of confidence. Why can't we get an exact statistic? Well, lets say that we want to know the mean number of M&Ms in 135g packets sold around the world. In order to get an exact value, we would need to test every packet in existance, wouldn't we? It is more practical to test, say, 100 packets and use the mean as an estimate, me, of the actual mean across the world. Through the use of confidence intervals, we can get a good idea for how close our estimate (me) is to the actual mean. Or, so that in our calculations we know that we cover the real mean, we can say that the real world mean is somewhere within the interval me±E, where E is the error. And we work this stuff out like this:
(the proof is left for other noders, this writeup isn't the place for silly mathematics) me±zα/2(&sigma/√n), where σ is the standard deviation of the population from which the sample is taken and zα/2 is the z-value leaving an area of α/2 to the right (check out the normal distribution and probability). That's great, but stupid: here are some usual z values for two common confidence intervals:
95% confidence has a z-value of 1.96
99% confidence has a z-value of 2.575

Note: σ represents the standard deviation of the population from which the sample is taken, not the sample. So, that above paragraph covered the maths, how do we actually apply it? Easily! Lets say that the height of some tree after a year of growth, in centimeters, has been measured for 20 such trees:

104 105 101 103 111 110 107 99 108 108 108 109 103 98 97 100 102 102 100 112
The mean of the above sample is:

2087/20 = _104.35_
And the standard deviation, determined by some other team and is known to be accurate for this type of tree, is 5.64 cm. (If the standard deviation is not known for the sample, then rather than using z-values, you use the t-value. The t-distribution isn't covered by this writeup though).
So, if we want to estimate the population mean, then we first need to decide how accurate we want our estimate to be. Seeing as 99% is a nice number, we'll make our estimate 99% accurate using a z-value of 2.575. So, to work out the error, we put the values into the above equation: error = zα/2(σ/√n) = 2.575(5.64/√20) ≅ 3.247
We can now say, with 99% certainty, that the mean growth of that tree in one year lies in the interval 104.35±3.247, or (101.10, 107.59). For more on confidence intervals, check out Professor Pi's writeup.

Hypothesis testing

Imagine, I know this is hard, but imagine that Coca-cola was overfilling its bottles! Whoa!! You pay for 600mL and you're getting 603mL! Barstards - Lets sue! But in order to sue, you need evidence. Seeing as you can't test every freaking bottle that coca-cola makes, your job is made quite hard. But, you can use confidence intervals to test your hypothesis that those barstards are overfilling. Just before you set out to work, you call Coca-cola to see what they have to say: "Each bottle contains a mean of 600mL of coke distributed with a standard devation of 1.1mL" Bah! Lies!. You want your mean to be strong, one that will hold up in court, so you'll measure the mean of 30 bottles (drinking whilst you measure, of course) and produce a 99% confidence interval. So, lets say that the mean you get is 604.5, well above the quoted 600mL! So, if you construct a 99% confidence interval and 600mL isn't within that interval, then they're obviously lying and you can sue them for all they're worth!
604.5±(2.575)(1.1/√30) ≅ 604.5±.517.

So, coke are trying to drown you! Sue.

Well, that's the most basic parts of statistics that I can think of to write. I decided against covering probability here, simply because the basics of probability would have doubled the size of this writeup, instead I'll probably write a similar one for probability. Until then, you won't know how to work out how many people-of-opposite-gender you can expect to sleep with =P

If I have missed anything important, please tell me

This writeup was brought to you by the letter σ and maths for the masses.