A Bayesian is someone who interprets probability as degree of
belief. This interpretation contrasts with the frequentist
interpretation.
Bayesian Interpretation of Probability
Almost anyone with the slightest numeracy has an understanding of
probability. A statement such as "tomorrow it will rain with 60%
probability" makes sense to anyone who has ever watched a weather
forecast. If we ask about the meaning of this statement, we might get
a response such as "well, it's not certain that tomorrow it will rain,
but it's rather likely that it will." If inquire about some more, we
might see that some people will carry an umbrella with them tomorrow,
and some might not. Some people believe that it 60% chance of rain is
a figure high enough to warrant a preventive umbrella, and some do
not.
This is a very common-sense interpretation of probability. A
probability is a number between zero and one that we assign to events
of which we are uncertain, with zero meaning absolute certainty of the
falsehood of some statement, and one is certainty of its truth, and
there are varying degrees of truth and belief in between. This is how
a Bayesian interprets a probability statement. Most people, unless
they have been exposed to some mathematical sophistication, hold
Bayesian interpretations of probability without explicitly knowing so.
Under the Bayesian interpretation, a probability of 60% means
different things to different people; some are willing to risk
their money at a game of poker and some are not. Thus probabilities
are subjective measurements, and this gives Bayesians their nickname:
subjectivists. This is a strong objection against the
Bayesian interpretation of probability. Some people,
believe that probabilities are numbers that can be
assigned objectively to statements about the world, that
objectively there either is or isn't a good reason for playing
poker, and anyone who doesn't adhere to these reasons is simply being
irrational.
But this is mere philosophy. We can discuss all day the meaning of
probability until the rain soaks us wet. The mathematical treatment
of probability leaves no room for interpretation. There are certain
rules to follow while we perform probabilistic calculations, and they
are based on three simple axioms. Mathematical abstract nonsense
allows us to circumvent unpleasant discussions.
Let us stick to pure mathematics for a moment. Let
P(A|B) denote the conditional probability
that event A happens given event B has happened; by
definition, P(A|B) = P(A
& B)/P(B), where P(A &
B) denotes the probability that both events A and
B happen simultaneously. We wish to find a formula for
P(B|A), e.g. if we know the probability
that it rains on Tuesdays, we would like to calculate the probability
that today is Tuesday given that it has rained. Straight from
definitions,
P(A & B)
P(B|A) = ---------,
P(A)
P(A|B) P(B)
= ------------.
P(A)
This is known as Bayes' Theorem.
In most situations of interest, we do not know P(A)
a priori and must calculate it from something else. Often, what we
have is a partition of the sample space into exhaustively many events
B1, B2, ...,
Bn such that
Σi=1n
P(Bi) = 1. Under such conditions, by the
law of total probability Bayes' Theorem becomes
P(A|B ) P(B )
k k
P(B |A) = ---------------.
k n
Σ P(A|B ) P(B )
i=1 i i
Another case that is often interesting involves continuous random
variables and their density functions. Suppose that θ is
some random variable with density f(θ), and
x is a vector of random variables (such as a
sequence of observations!) with joint density function
g(x). Then Bayes' Theorem takes the
form
g(x|θ) f(θ)
f(θ|x) = ----------------,
∫ g(x|θ)f(θ) dθ
θ
where the discrete sum has been replaced by a continuous integration
taken over all possible values of θ.
Why is this any of our business? Because Bayesians take their
name from the application of this theorem, formulated first in a paper
by the late Reverend Thomas Bayes (1702-1761), published
posthumously in 1763. The above development is purely mathematical,
follows from axioms and definitions of probability. Bayesians,
however, have an interesting way of applying this to the world. Bayesians interpret Bayes' Theorem as a mathematical statement of how
experience modifies our beliefs.
Bayesian Statistics
The Bayesian interpretation only becomes important once we start to
indulge in statistical inference. The situation is the following: the
world is one big complicated mess, and there are many things of which
we aren't sure. Nevertheless, there are some things that we can
approximate, and our beliefs about the world are among them. We come
to the world with certain pre-suppositions and beliefs, and we refine
and modify them as we make observations. If we once believed that
every day it is equally likely or not to rain, we might modify our
beliefs after spending a few months in the Amazon rainforest.
We therefore postulate a model of the world, that there are certain
parameters that describe probability distributions of which we take
samplings. Such parameters could be proportion of people who respond
positively to a certain medication, the mean annual temperature of the
Gobi desert, or the mass of a proton. These parameters are fixed, but
we allow ourselves to express our uncertainty of their true values by
giving them probability distributions. We assign them a probabilities
based on hunches or intuitive reasoning. This distribution we assign
before making any observations is called a prior
distribution. Then we conduct some experiments and apply Bayes'
Theorem in order to modify this into a posterior distribution
that reflects the new information we have. This procedure can be
repeated, with our posterior as a new prior distribution, and further
experimentation may yield an even better second posterior distribution.
I will present the general idea in more detail with an
example. Suppose that we would like to estimate the proportion of
Bayesian statisticans who are female. We will begin with a clean
slate and make no assumption as to what this proportion is, and we
shall quantify this un-assumption by stating that, as far as we know,
the proportion of female Bayesians is a random variable that is
equally likely to lie anywhere between zero and one (this is known as
a uniform random variable or a uniform distribution). Let
θ denote this uniform random variable. Its density
function is
/ 1 if 0 < x < 1,
f (θ) = {
\ 0 otherwise.
This shall be our prior distribution. In this case, it is called an
uninformative prior distribution, because it does not tell us
to expect any particular value of θ. Its graph is a boring
straight line:
|
2 +
|
+
|
+
|
+
|
+
1.5+
|
+
|
+
|
+
|
+
|
1 *AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
|
+
|
+
|
+
|
+
|
0.5+
|
+
+
|
+
|
+
|
+---+---+---+----+---+---+---+----+---+---+---+---+----+---+---+---+---+----+---+---+---+----+---+---+---+
0 |
0.2 0.4 0.6 0.8 1
This is a very bare-bones model about the world so far, but it's about
to get better. We go out amongst all our Bayesian friends (ahem, a
"random sample") and count the number of X chromosomes, to the best of
our abilities. Suppose there were twelve X chromosomes and eight Y
chromosomes (eight boys, two girls). Let x
denote the random variable "number of female Bayesians in a random
sample of 10"; this is a binomial random variable with probability
of success equal to θ. In our example, we observed on this
particular instance that x = 2. The conditional probability
density function of x given θ is
/ /10\ x 10-x
| ( ) θ (1-θ) if x = 0, 1, ..., 10
| \ x/
g(x|θ) = {
|
| 0 otherwise
\
In light of this new information, we shall now modify the distribution
of θ. To this effect, we invoke Bayes' Theorem that in this
instance reads as
g(x=2|θ) f(θ)
f(θ|x=2) = ----------------------.
1
∫ g(x=2|θ) f(θ) dθ
0
A computation now ensues:
/10\ 2 8
( ) θ (1-θ)
\ 2/
f(θ|x=2) = ---------------------------
/1 /10\ 2 8
| ( ) θ (1-θ) dθ
/ 0 \ 2/
2 8
θ (1-θ)
= ---------------------------
/1 2 8
| θ (1-θ) dθ
/ 0
2 8
θ (1-θ)
= --------------------------- (This integration can be performed by observing
1 that the integrand is a Beta distribution with parameters
---- α=9 and β=3.)
495
2 8
= 495 θ (1-θ).
This is the posterior distribution. We recognise it to be a Beta
distribution with parameters α=9 and β = 3. Its graph has
a big hump around 0.2 and looks like this:
|
+ AAAAAA
+ AA AA
+ A AA
| A A
3 + A AA
+ A AA
+ A A
+ A A
+ A A
2.5+ A A
+ A A
| AA A
+ A A
+ A A
+ A AA
2 + A A
+ A A
+ A A
| A A
+ A A
+ A AA
1.5+ A A
+ A A
+ A A
+ A A
+ A A
| A AA
1 + AA A
+ A A
+ A AA
+ A AA
+ A A
0.5+ A AA
| AA AAA
+ A AA
+ AA AAA
+ A AAA
+A AAAAAA
**--+---+---+----+---+---+---+----+---+---+---+---+----+---+---+---+**************************************
0 |
0.2 0.4 0.6 0.8 1
Thus, even though we still are not sure of the true proportion of
female Bayesians in the world, experience has taught us that we may
expect about 20% of all Bayesians in the world to be female, and we
can even quantify with probability statements the strength of our
beliefs. The cool thing is that we can keep on modifying our
distribution as we see fit. We could peform more experiments and
surveys with this Beta distribution as our new prior distribution. If
we work out another example, we will get another Beta distribution
with different parameters. I was a bit sneaky and chose the uniform distribution because I knew it was a Beta distribution with parameters α=1 and β=1, and I knew that I would get another Beta distribution for the posterior. The mathematics doesn't always work out so
nicely, unfortunately. When it does, and the prior and posterior
distribution belong to the same family, we call them a conjugate
pair of priors.
In the Bayesian framework, we can construct probability
intervals, in analogy to the more common confidence intervals
of frequentist statistics, except that now we can make true
probability statements as to where the parameter will lie, because we
have assigned a probability distribution to said parameter. For example, with our posterior distribution, we can correctly make a statement such as "as far as we know, the proportion of female Bayesians is between 0.1 and 0.3 with probability 59.8%".
Bayesian statistics begins here, with the assumption that it makes
sense to quantify our beliefs by probabilities. More sophisticated
techniques will rely on this basic postulate, and prior and posterior
probability distributions will almost always be present in one form or
another during our investigations. Some people find Bayesian
statistics more intuitive and straightforward than the complicated
interpretation that frequentist statistics require. It is perhaps for
these reasons that Bayesian statistics have gained popularity in
recent years, although it is probably safe to say (with probability
80%) that the majority of statistics conducted nowadays are of a
frequentist fashion.
Some Objections to Bayesian Interpretations
Not everyone is convinced by Bayesian statistics. For some, the base
assumptions are very fishy. Probabilities are subjective measurements?
Nonsense! And how are you going to choose your prior distribution?
Different priors will yield different posteriors; your prejudices will
forever affect the way you see the world! Not to mention that
calculations are often more involved in Bayesian statistics, and
complicated integrals will abound. It also seems to require more
assumptions than frequentist statistics, and it is a good rule to
take the simplest model of the world possible. These are all valid
points, and I shall briefly address them in turn.
That probabilities are subjective measurements should not bother us,
since the actual mathematical theory itself does not make any subjective judgements based upon the numbers. Bayesian statistics offers probabilities and numbers, beginning with an assumption that it makes sense to quantify belief with probability, but does not actually impose any further subjective judgements. Instead, the theory allows for every individual to make the appropriate decision. As for the impact of the prior distribution,
there are few situations where we are so completely ignorant of the
situation as to have to assign a completely arbitrary prior
distribution. Even in situations where our knowledge is very limited, we
can reflect this by an uninformative uniform prior over some
interval. Hopefully, the impact of our prior distribution will fade as
we make more and more experiments. In fact, it can be shown that over
repeated experimentation, almost any reasonable prior distribution
will converge to a determined posterior distribution. The complexity
of calculations should not bother us so much in this day where
computers have facilitated numerical methods. We can always resort to
them if needed. As for the extra assumptions required to do Bayesian
statistics, I will say that yes, the Bayesian model is slightly more
complicated than the frequentist, but it is thanks to this that the
Bayesian model also has the ability to predict more. It is also true
that sometimes nature just isn't as simple as we might hope, and a more
complicated model is necessary.
Bayesian statistics are favoured in many areas of modern scientific
research, particularly in biostatistics. The Bayesian model also has
been used to great advantage in computer algorithms for blocking
unwanted spam email, for example. I can understand why, regardless,
many people would prefer to stick to the frequentist interpretation of
probability and remain as objective as possible. It is important to keep extraneous assumptions to a mininum.