Bayesian Inference

Presets

— Prior — Likelihood — Posterior

Heads: 0

Tails: 0

Total: 0

Show true θ

True bias (θ) 0.50

Prior strength (α₀, β₀) 1, 1

Prior center 0.50

Prior — Beta(1, 1)

Mean0.500

Mode—

95% CI[0.025, 0.975]

Posterior — Beta(1, 1)

Mean0.500

Mode—

95% CI[0.025, 0.975]

CI width0.950

Evidence history

Flip sequence

Bayes' theorem

Thomas Bayes' insight, published posthumously in 1763: the probability of a hypothesis given evidence is proportional to the probability of the evidence given the hypothesis, times the prior probability of the hypothesis.

P(θ | D) = P(D | θ) × P(θ)P(D)

The denominator P(D) is just a normalizing constant — it ensures the posterior integrates to 1. So the core relationship is: posterior is proportional to likelihood times prior. This is the update rule you see above: each coin flip reshapes the gold curve.

Conjugate priors

The Beta distribution is special for coin-flip problems. When your prior is Beta(α, β) and you observe h heads and t tails, the posterior is Beta(α + h, β + t). The prior and posterior belong to the same family — this is called conjugacy.

Conjugacy makes computation trivial: no numerical integration, no approximation. You just add counts. The parameters α and β can be interpreted as “pseudo-counts” — fictitious observations baked into your prior belief. A Beta(1, 1) prior is uniform: one pseudo-head, one pseudo-tail. A Beta(50, 50) prior acts as if you’ve already seen 100 flips of a fair coin.

The prior washes out

Try the “Wrong prior” preset: the prior is strongly centered at 0.5, but the true bias is 0.8. Click “Flip 100” a few times and watch. The posterior initially resists, held in place by the strong prior. But as evidence accumulates, it migrates toward the truth.

This is a fundamental theorem of Bayesian statistics: for any prior that assigns nonzero probability to the true parameter, the posterior converges to the truth as data grows. The prior matters in small samples. In large samples, the data overwhelm it. Two scientists who start with very different priors will eventually agree — given enough evidence.

The base rate fallacy

Select the “Medical test” preset. A disease affects 1% of the population. A test is 99% sensitive (catches 99% of sick people) and 95% specific (correctly clears 95% of healthy people). Someone tests positive. What is the probability they have the disease?

Most people intuitively answer around 99%. The correct answer is about 17%. Why? Because for every 10,000 people, about 100 are sick and 99 of them test positive. But 9,900 are healthy, and 5% of those — 495 — test positive too. So 99 true positives are buried among 495 false positives. The base rate (1% prevalence) is the prior, and ignoring it is the base rate fallacy.

This is not a curiosity. It affects medical screening, criminal justice, spam filters, and any domain where a rare condition is tested for. Bayes' theorem is the corrective lens.

Bayesian vs. frequentist

A frequentist would estimate the coin’s bias as simply heads/total — the maximum likelihood estimate. No prior, no posterior, just the data. After 0 flips, the estimate is undefined. After 1 head, it’s 100%.

A Bayesian starts with a prior and updates it. After 0 flips, the estimate is the prior mean. After 1 head, it shifts toward 1 but is tempered by the prior. The Bayesian approach naturally handles small samples, quantifies uncertainty as a full distribution (not just a point estimate), and lets you incorporate prior knowledge.

The tradeoff: the Bayesian must choose a prior, and that choice is subjective. The frequentist avoids this but loses the ability to say “there is a 95% probability the parameter lies in this interval” — frequentist confidence intervals don’t mean that. Bayesian credible intervals do.

Explore more

Probability distributions → Galton board → Markov chains → All experiments →