The chi-square statistic (chi being the Greek letter) is a test of statistical significance of a set of values or as a test for independence between two categorical variables in a table. You add up the difference between each value and the value expected under the null hypothesis, squared, and divide that by the expected value. Adding up all the terms yields the chi-square statistic, denoted by chi-squared. Chi-square is always >= zero. An inordinately high (or possibly an inordinately small1) value indicates statistical significance. Just how significant requires obtaining looking up or computing a P-value from a chi-square curve.

Example: A company accused of sex discrimination employs 630 men and 470 women. According to the null hypothesis, this difference is due to random variation, and the expected number of men and women would be 550. Chi-square would thus be (630 - 550)2/550 + (470-550)2/550 = 23.3 Looking this up on a chi-square curve with one degree of freedom gives a P-value between 1 and 2 percent. This provides strong evidence to reject the null hypothesis--we can say with better than 98% confidence that something is causing an imbalance of men (note that we haven't shown that they are biased against women, only that there really are too many men in the company).

1dogboy points out that only an inordinately high value indicates significance. This is generally true but it depends on what we are looking at. Sometimes an inordinately low number can indicate that there is too little chance error. If I have reported data for coin tosses, for instance, and there are supposedly 50,006 heads and 49,994, chi-square will be about .001. This is so low, that the p-value is about .99. This would be excellent evidence that the data was fabricated or massaged! This is precisely the technique used to show that Gregor Mendel certainly fudged his pea-plant experiments in genetics to fit his (correct) theory. But dogboy's well-written nitpicking is generally valid, though I think we were pretty much driving at the same thing.

Chi-square analysis is a statistical method to calculate the probability that two dichotomous variables within a sample or population are related. This is calculated with respect to the normal distribution.

For example: we may wish to determine if there is a relationship between blond hair and gender. Each variable can only have two values (i.e. dichotomous); blond/not blond and male/female. In this example we will observe 100 people, 48 males, 52 females, 13 women are blond, 10 men are blond. A contingency table is created to represent these values that looks like this:

Blond/ Not
F 13 39
M 10 38

Each column and row is then summed, in this case R1=52, R2=48, C1=23, C2=77. The value of each cell is f11=13, f12=39, f21=10, f22=38. The values are plugged into the Eq.1 to calculate a Chi-square value:

Eq.1 chisq = n(f11 f22- f12 f21)^2/ R1 R2 C1 C2

Plugging in the numbers of our example gives us a Chi-square value of 0.065967. This value is analogous in use to the Z-value of the normal distribution and the associated p=0.80 where degrees of freedom=1 and the chi-square is 0.068967.

If p<=0.05 we could conclude that the presence of blond hair is related to gender we must reject this hypothesis since p>0.05.

In cases where dichotomous variables have small frequencies we may use Haber's method to calculate chi-square as it is more robust than Eq.1. Eq. 2 is Haber's correction.

Eq.1 chisq = n^3D^2/ R1 R2 C1 C2

The variable D is used to replace part of the numerator of the calculation. D is determined by calculating f^ and d; f^=RminimumCminimum/n, d=abs(fminimum-f^)

and if f<=2f^ then D = the largest multiple of 0.5 that is < d;
if f > 2f^ then D=d-0.5.

A chi-square test is used in statistics to determine whether or not there is a correlation between two variables.

The most common chi-square test is Pearson's chi-square test - if you just hear the words "chi-square," 99% of the time this is what they're talking about. This test assumes:

  • a random sample;
  • a large sample size (although this is rather arbitrary);
  • all cells in a Variable X Variable table will have at least a count of 5 - if not, apply Yates' correction (more on that later);
  • similar distribution of the population; and
  • a non-directional hypothesis - that is, discovering two variables are related does not imply one causes the other or vice versa.

The chi-square test formula is

Χ2 = Σ (Observed - Expected)2 / Expected

Ok, if all that is confusing, now that we've covered the statistical mumbo-jumbo, let's put our knowledge to work. First, let's get some data.

Example: The Department of Transportation wants to know if more traffic accidents occur on the weekends or on the weekdays. You head down to the local DMV and get the following data:
Day of the Week     Accidents
Sunday              42 
Monday              36
Tuesday             29
Wednesday           35
Thursday            36
Friday              44
Saturday            37

So, what are we expecting here? Well, we expect each day to have the same number of accidents. To find out what each day should be, take each column (only accidents, in this case), add the values up, and divide it by the number of elements (in this case, 7 days of the week). We get an expected value of 37.

Next, we apply our formula. I'll do the first one Sunday, and let you guys do the rest:

(Observed - Expected)2 / Expected

means (42 - 37)2 / 37 =

25 / 37

Continuing on, we end up with a Χ2 value of 144 / 37, or roughly 3.892. Next, we figure out the degrees of freedom. This is always equal to (n - 1) in chi-square tests with only one column, i.e. 6. In a multi-row, multi-column table, dF is equal to (r - 1) * (c - 1). Now, the fun part: get out your handy-dandy chi-square table. What? Don't have one? You can find them in the back of most statistics books, or, if you're lucky, your calculator or statistical analysis program will have one built in. (Update: blaaf has provided a handy-dandy table at chi-square curve.) Looking up our upper critical value in the books, we see that our Χ2 value would have to exceed 10.645 to be statistically significant. Therefore, we can safely go tell our boss at the DOT that accidents pretty much happen at the same rate every day of the week.

The Skinny

In conclusion, a chi-square test compares the observed values and the expected values to see if there's a significant correlation between them. It is an excellent simple tool for calculating that two variables are affecting each other somehow.

Addenda: Yates' correction basically punishes low cell counts, which suggest a non-rigorous sampling. To apply the correction, if any cell in the table has a value of less than 5, subtract .5 from every O - E value before squaring it and dividing by E.

Log in or registerto write something here or to contact authors.