A Bayesian is someone who interprets probability as degree of belief. This interpretation contrasts with the frequentist interpretation.

Bayesian Interpretation of Probability

Almost anyone with the slightest numeracy has an understanding of probability. A statement such as "tomorrow it will rain with 60% probability" makes sense to anyone who has ever watched a weather forecast. If we ask about the meaning of this statement, we might get a response such as "well, it's not certain that tomorrow it will rain, but it's rather likely that it will." If inquire about some more, we might see that some people will carry an umbrella with them tomorrow, and some might not. Some people believe that it 60% chance of rain is a figure high enough to warrant a preventive umbrella, and some do not.

This is a very common-sense interpretation of probability. A probability is a number between zero and one that we assign to events of which we are uncertain, with zero meaning absolute certainty of the falsehood of some statement, and one is certainty of its truth, and there are varying degrees of truth and belief in between. This is how a Bayesian interprets a probability statement. Most people, unless they have been exposed to some mathematical sophistication, hold Bayesian interpretations of probability without explicitly knowing so.

Under the Bayesian interpretation, a probability of 60% means different things to different people; some are willing to risk their money at a game of poker and some are not. Thus probabilities are subjective measurements, and this gives Bayesians their nickname: subjectivists. This is a strong objection against the Bayesian interpretation of probability. Some people, believe that probabilities are numbers that can be assigned objectively to statements about the world, that objectively there either is or isn't a good reason for playing poker, and anyone who doesn't adhere to these reasons is simply being irrational.

But this is mere philosophy. We can discuss all day the meaning of probability until the rain soaks us wet. The mathematical treatment of probability leaves no room for interpretation. There are certain rules to follow while we perform probabilistic calculations, and they are based on three simple axioms. Mathematical abstract nonsense allows us to circumvent unpleasant discussions.

Bayes' Theorem

Let us stick to pure mathematics for a moment. Let P(A|B) denote the conditional probability that event A happens given event B has happened; by definition, P(A|B) = P(A & B)/P(B), where P(A & B) denotes the probability that both events A and B happen simultaneously. We wish to find a formula for P(B|A), e.g. if we know the probability that it rains on Tuesdays, we would like to calculate the probability that today is Tuesday given that it has rained. Straight from definitions,

               P(A & B)
    P(B|A)  = ---------,
                 P(A)

               P(A|B) P(B)
            = ------------.
                 P(A) 

This is known as Bayes' Theorem.

In most situations of interest, we do not know P(A) a priori and must calculate it from something else. Often, what we have is a partition of the sample space into exhaustively many events B1, B2, ..., Bn such that Σi=1n P(Bi) = 1. Under such conditions, by the law of total probability Bayes' Theorem becomes

                    P(A|B ) P(B )
                         k     k               
       P(B |A)  =  ---------------.
          k        n
                   Σ P(A|B ) P(B )
                  i=1     i     i                     

Another case that is often interesting involves continuous random variables and their density functions. Suppose that θ is some random variable with density f(θ), and x is a vector of random variables (such as a sequence of observations!) with joint density function g(x). Then Bayes' Theorem takes the form

               g(x|θ) f(θ)
     f(θ|x) = ----------------,
               ∫ g(x|θ)f(θ) dθ
                θ

where the discrete sum has been replaced by a continuous integration taken over all possible values of θ.

Why is this any of our business? Because Bayesians take their name from the application of this theorem, formulated first in a paper by the late Reverend Thomas Bayes (1702-1761), published posthumously in 1763. The above development is purely mathematical, follows from axioms and definitions of probability. Bayesians, however, have an interesting way of applying this to the world. Bayesians interpret Bayes' Theorem as a mathematical statement of how experience modifies our beliefs.

Bayesian Statistics

The Bayesian interpretation only becomes important once we start to indulge in statistical inference. The situation is the following: the world is one big complicated mess, and there are many things of which we aren't sure. Nevertheless, there are some things that we can approximate, and our beliefs about the world are among them. We come to the world with certain pre-suppositions and beliefs, and we refine and modify them as we make observations. If we once believed that every day it is equally likely or not to rain, we might modify our beliefs after spending a few months in the Amazon rainforest.

We therefore postulate a model of the world, that there are certain parameters that describe probability distributions of which we take samplings. Such parameters could be proportion of people who respond positively to a certain medication, the mean annual temperature of the Gobi desert, or the mass of a proton. These parameters are fixed, but we allow ourselves to express our uncertainty of their true values by giving them probability distributions. We assign them a probabilities based on hunches or intuitive reasoning. This distribution we assign before making any observations is called a prior distribution. Then we conduct some experiments and apply Bayes' Theorem in order to modify this into a posterior distribution that reflects the new information we have. This procedure can be repeated, with our posterior as a new prior distribution, and further experimentation may yield an even better second posterior distribution.

I will present the general idea in more detail with an example. Suppose that we would like to estimate the proportion of Bayesian statisticans who are female. We will begin with a clean slate and make no assumption as to what this proportion is, and we shall quantify this un-assumption by stating that, as far as we know, the proportion of female Bayesians is a random variable that is equally likely to lie anywhere between zero and one (this is known as a uniform random variable or a uniform distribution). Let θ denote this uniform random variable. Its density function is

               /  1   if  0 < x < 1,
     f (θ)  = {
               \  0   otherwise.

This shall be our prior distribution. In this case, it is called an uninformative prior distribution, because it does not tell us to expect any particular value of θ. Its graph is a boring straight line:


   |
 2 +
   |
   +
   |
   +
   |
   +
   |
   +
1.5+
   |
   +
   |
   +
   |
   +
   |
   +
   |
 1 *AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
   |
   +
   |
   +
   |
   +
   |
   +
   |
0.5+
   |
   +
   +
   |
   +
   |
   +
   |
   +---+---+---+----+---+---+---+----+---+---+---+---+----+---+---+---+---+----+---+---+---+----+---+---+---+
 0 |
                       0.2                  0.4                  0.6                  0.8                   1

This is a very bare-bones model about the world so far, but it's about to get better. We go out amongst all our Bayesian friends (ahem, a "random sample") and count the number of X chromosomes, to the best of our abilities. Suppose there were twelve X chromosomes and eight Y chromosomes (eight boys, two girls). Let x denote the random variable "number of female Bayesians in a random sample of 10"; this is a binomial random variable with probability of success equal to θ. In our example, we observed on this particular instance that x = 2. The conditional probability density function of x given θ is

 
                /  /10\   x      10-x
               |  (    ) θ  (1-θ)         if x = 0, 1, ..., 10
               |   \ x/
   g(x|θ)  =  {  
               |
               |     0                    otherwise
                \

In light of this new information, we shall now modify the distribution of θ. To this effect, we invoke Bayes' Theorem that in this instance reads as

                    g(x=2|θ) f(θ)
    f(θ|x=2)  = ----------------------.
                  1
                 ∫  g(x=2|θ) f(θ) dθ
                  0

A computation now ensues:

                  /10\   2       8 
                 (    ) θ   (1-θ)
                  \ 2/
    f(θ|x=2)  = ---------------------------
                  /1   /10\   2       8 
                 |    (    ) θ   (1-θ)  dθ
                / 0    \ 2/

                  
                        2      8
                       θ  (1-θ)
              = ---------------------------
                  /1    2       8 
                 |     θ   (1-θ)  dθ
                / 0   


                        2      8
                       θ  (1-θ)
              = ---------------------------  (This integration can be performed by observing
                          1                   that the integrand is a Beta distribution with parameters
                        ----                  α=9 and β=3.)
                         495

            
                      2      8
             =   495 θ  (1-θ).

This is the posterior distribution. We recognise it to be a Beta distribution with parameters α=9 and β = 3. Its graph has a big hump around 0.2 and looks like this:



   |
   +                 AAAAAA
   +                AA     AA
   +                A       AA
   |               A          A
 3 +              A           AA
   +             A             AA
   +             A               A
   +            A                A
   +            A                 A
2.5+           A                   A
   +           A                    A
   |          AA                    A
   +          A                      A
   +          A                       A
   +         A                        AA
 2 +         A                         A
   +        A                           A
   +        A                           A
   |        A                            A
   +       A                              A
   +       A                              AA
1.5+       A                                A
   +      A                                 A
   +      A                                  A
   +      A                                   A
   +     A                                     A
   |     A                                     AA
 1 +    AA                                       A
   +    A                                         A
   +    A                                         AA
   +    A                                           AA
   +   A                                             A
0.5+   A                                              AA
   |  AA                                               AAA
   +  A                                                  AA
   + AA                                                    AAA
   + A                                                        AAA
   +A                                                            AAAAAA
   **--+---+---+----+---+---+---+----+---+---+---+---+----+---+---+---+**************************************
 0 |
                       0.2                  0.4                  0.6                  0.8                   1

Thus, even though we still are not sure of the true proportion of female Bayesians in the world, experience has taught us that we may expect about 20% of all Bayesians in the world to be female, and we can even quantify with probability statements the strength of our beliefs. The cool thing is that we can keep on modifying our distribution as we see fit. We could peform more experiments and surveys with this Beta distribution as our new prior distribution. If we work out another example, we will get another Beta distribution with different parameters. I was a bit sneaky and chose the uniform distribution because I knew it was a Beta distribution with parameters α=1 and β=1, and I knew that I would get another Beta distribution for the posterior. The mathematics doesn't always work out so nicely, unfortunately. When it does, and the prior and posterior distribution belong to the same family, we call them a conjugate pair of priors.

In the Bayesian framework, we can construct probability intervals, in analogy to the more common confidence intervals of frequentist statistics, except that now we can make true probability statements as to where the parameter will lie, because we have assigned a probability distribution to said parameter. For example, with our posterior distribution, we can correctly make a statement such as "as far as we know, the proportion of female Bayesians is between 0.1 and 0.3 with probability 59.8%".

Bayesian statistics begins here, with the assumption that it makes sense to quantify our beliefs by probabilities. More sophisticated techniques will rely on this basic postulate, and prior and posterior probability distributions will almost always be present in one form or another during our investigations. Some people find Bayesian statistics more intuitive and straightforward than the complicated interpretation that frequentist statistics require. It is perhaps for these reasons that Bayesian statistics have gained popularity in recent years, although it is probably safe to say (with probability 80%) that the majority of statistics conducted nowadays are of a frequentist fashion.

Some Objections to Bayesian Interpretations

Not everyone is convinced by Bayesian statistics. For some, the base assumptions are very fishy. Probabilities are subjective measurements? Nonsense! And how are you going to choose your prior distribution? Different priors will yield different posteriors; your prejudices will forever affect the way you see the world! Not to mention that calculations are often more involved in Bayesian statistics, and complicated integrals will abound. It also seems to require more assumptions than frequentist statistics, and it is a good rule to take the simplest model of the world possible. These are all valid points, and I shall briefly address them in turn.

That probabilities are subjective measurements should not bother us, since the actual mathematical theory itself does not make any subjective judgements based upon the numbers. Bayesian statistics offers probabilities and numbers, beginning with an assumption that it makes sense to quantify belief with probability, but does not actually impose any further subjective judgements. Instead, the theory allows for every individual to make the appropriate decision. As for the impact of the prior distribution, there are few situations where we are so completely ignorant of the situation as to have to assign a completely arbitrary prior distribution. Even in situations where our knowledge is very limited, we can reflect this by an uninformative uniform prior over some interval. Hopefully, the impact of our prior distribution will fade as we make more and more experiments. In fact, it can be shown that over repeated experimentation, almost any reasonable prior distribution will converge to a determined posterior distribution. The complexity of calculations should not bother us so much in this day where computers have facilitated numerical methods. We can always resort to them if needed. As for the extra assumptions required to do Bayesian statistics, I will say that yes, the Bayesian model is slightly more complicated than the frequentist, but it is thanks to this that the Bayesian model also has the ability to predict more. It is also true that sometimes nature just isn't as simple as we might hope, and a more complicated model is necessary.

Bayesian statistics are favoured in many areas of modern scientific research, particularly in biostatistics. The Bayesian model also has been used to great advantage in computer algorithms for blocking unwanted spam email, for example. I can understand why, regardless, many people would prefer to stick to the frequentist interpretation of probability and remain as objective as possible. It is important to keep extraneous assumptions to a mininum.

Log in or registerto write something here or to contact authors.