Or, how I learned to stop worrying about normalization factors and love "bayesian" inference
It seems to me that there's a red herring in bayesian statistics: Bayes' theorem. In the Kolmogorovian formalism, the "Bayes formula", as I was taught, is simply a corollary to the Total Probability Theorem. First, let (A1,A2,...,An) be a partition over the probability space Ω such that P(Ω) = 1, then we have
(1) P(B) = Σ P(Ai)P(B|Ai)
which makes perfect sense since you're spanning the entire probability space with the partitions. (It does have a formal proof). Now, the "Bayes formula" as I was originally taught by the russians is obtained applying the definition of conditional probability to the Total Probability Theorem. That is, given that
(2) P(Ai|B)≡P(AiB)/P(B)
one can substitute P(B) from (1) and get
(3) P(Ai|B) =P(AiB)/ Σ P(Ai)P(B|Ai)
which is a way to obtain "inversely conditional" probabilities, and just fine for many applications – game theory springs to mind – but is scarcely grounds for statistical inference on its own, specially with the "conjugate priors" of the 20th century bayesian school. And while the frequentist interpretation is somewhat off, the Law of Large Numbers confers soundness to the logic of arguments based on limit frequencies.
Recently some different ideas have been impressed on me by Jaynes' probability manual. First, let's look at the definition of conditional probability from inside out:
(4a) P(AiB) = P(Ai)P(B|Ai)
(4b) P(AiB) = P(B)P(B|Ai)
This way, Bayes' formula can be stated in a prior-posterior form, substituting (4) into the numerator of (3)
(5) P(Ai|B) =P(B)P(B|Ai)/ Σ P(Ai)P(B|Ai)
and then (4b) into the denominator of (5), giving
(6) P(Ai|B) = P(B){P(B|Ai) /ΣP(AiB)} = P(B){P(B|Ai) /ΣP(BAi)}
In plain klingon, this is
(6k) posterior = prior × (likelihood / evidence)
I'm still not buying "cogent priors" other than plain uninformative ones, but this is starting to become beautiful. Now, in an amazing digression, let me take you back to the neural nets 101 course I took years ago. The first example of an artificial neuron is Hebbian learning, which comes down to "fire together, wire together". Makes perfect intuitive sense, doesn't? That's actually a good approximation of how actual neurons are calibrated, I'm told, but I'm no physician. Anyway, this is the "hebbian" version of the Bayes formula:
(6h) {wire together} = prior × {fire together}
In actual bayesian practice the "evidence" is actually taken to be a constant determined by integrating the distributions to 1, but (6h) is the prettiest way of stating the Bayes formula as understood by the self-annointed "bayesians". Invoking that old monk is a distraction, and Jaynes comes epsilon-close to admitting so. This is actually about events (fire together) shaping theories (wire together).