Chi-square analysis is a statistical method to calculate the probability that two dichotomous variables within a sample or population are related. This is calculated with respect to the normal distribution.

For example: we may wish to determine if there is a relationship between blond hair and gender. Each variable can only have two values (i.e. dichotomous); blond/not blond and male/female. In this example we will observe 100 people, 48 males, 52 females, 13 women are blond, 10 men are blond. A contingency table is created to represent these values that looks like this:

Blond/ Not
F 13 39
M 10 38

Each column and row is then summed, in this case R1=52, R2=48, C1=23, C2=77. The value of each cell is f11=13, f12=39, f21=10, f22=38. The values are plugged into the Eq.1 to calculate a Chi-square value:


Eq.1 chisq = n(f11 f22- f12 f21)^2/ R1 R2 C1 C2

Plugging in the numbers of our example gives us a Chi-square value of 0.065967. This value is analogous in use to the Z-value of the normal distribution and the associated p=0.80 where degrees of freedom=1 and the chi-square is 0.068967.

If p<=0.05 we could conclude that the presence of blond hair is related to gender we must reject this hypothesis since p>0.05.

In cases where dichotomous variables have small frequencies we may use Haber's method to calculate chi-square as it is more robust than Eq.1. Eq. 2 is Haber's correction.


Eq.1 chisq = n^3D^2/ R1 R2 C1 C2

The variable D is used to replace part of the numerator of the calculation. D is determined by calculating f^ and d; f^=RminimumCminimum/n, d=abs(fminimum-f^)

and if f<=2f^ then D = the largest multiple of 0.5 that is < d;
if f > 2f^ then D=d-0.5.

A chi-square test is used in statistics to determine whether or not there is a correlation between two variables.

The most common chi-square test is Pearson's chi-square test - if you just hear the words "chi-square," 99% of the time this is what they're talking about. This test assumes:

  • a random sample;
  • a large sample size (although this is rather arbitrary);
  • all cells in a Variable X Variable table will have at least a count of 5 - if not, apply Yates' correction (more on that later);
  • similar distribution of the population; and
  • a non-directional hypothesis - that is, discovering two variables are related does not imply one causes the other or vice versa.

The chi-square test formula is

Χ2 = Σ (Observed - Expected)2 / Expected

Ok, if all that is confusing, now that we've covered the statistical mumbo-jumbo, let's put our knowledge to work. First, let's get some data.

Example: The Department of Transportation wants to know if more traffic accidents occur on the weekends or on the weekdays. You head down to the local DMV and get the following data:
Day of the Week     Accidents
Sunday              42 
Monday              36
Tuesday             29
Wednesday           35
Thursday            36
Friday              44
Saturday            37

So, what are we expecting here? Well, we expect each day to have the same number of accidents. To find out what each day should be, take each column (only accidents, in this case), add the values up, and divide it by the number of elements (in this case, 7 days of the week). We get an expected value of 37.

Next, we apply our formula. I'll do the first one Sunday, and let you guys do the rest:

(Observed - Expected)2 / Expected

means (42 - 37)2 / 37 =

25 / 37

Continuing on, we end up with a Χ2 value of 144 / 37, or roughly 3.892. Next, we figure out the degrees of freedom. This is always equal to (n - 1) in chi-square tests with only one column, i.e. 6. In a multi-row, multi-column table, dF is equal to (r - 1) * (c - 1). Now, the fun part: get out your handy-dandy chi-square table. What? Don't have one? You can find them in the back of most statistics books, or, if you're lucky, your calculator or statistical analysis program will have one built in. (Update: blaaf has provided a handy-dandy table at chi-square curve.) Looking up our upper critical value in the books, we see that our Χ2 value would have to exceed 10.645 to be statistically significant. Therefore, we can safely go tell our boss at the DOT that accidents pretty much happen at the same rate every day of the week.

The Skinny

In conclusion, a chi-square test compares the observed values and the expected values to see if there's a significant correlation between them. It is an excellent simple tool for calculating that two variables are affecting each other somehow.

Addenda: Yates' correction basically punishes low cell counts, which suggest a non-rigorous sampling. To apply the correction, if any cell in the table has a value of less than 5, subtract .5 from every O - E value before squaring it and dividing by E.

Log in or register to write something here or to contact authors.