Lecture 3: Estimation in Action — Confidence Sets, Empirical CDF, Bootstrap & MLE

Course: Core Concepts in Statistical Learning Lecture 3 Date: 2026-03-06

This lecture brings together the estimation machinery from Lecture 2 and shows how it works in practice. We start with a concrete worked example of asymptotic normality for the Bernoulli model, then build confidence intervals — a way to quantify our uncertainty. We then introduce the empirical CDF, the workhorse of non-parametric estimation, along with plug-in estimators and the bootstrap — a powerful computational trick for approximating quantities that are hard to compute analytically. Finally, we introduce Maximum Likelihood Estimation (MLE), the most widely used parametric estimation method.

1. Worked Example: Estimating a Coin's Bias

Let's start with a complete worked example that ties together bias, variance, MSE, and asymptotic normality from Lecture 2. Suppose we flip a coin $n$ times and want to estimate the probability of heads, $\theta$.

1.1 Setup

Let $X_1, \dots, X_n$ be i.i.d. $\sim \text{Bernoulli}(\theta)$ for $\theta \in (0,1)$. Each $X_i$ is 1 (heads) with probability $\theta$ and 0 (tails) with probability $1 - \theta$. The natural estimator is the sample mean:

$$ \hat{\theta}_n = \bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i $$
Intuition

$\bar{X}_n$ is literally the fraction of heads we observed. If we flip 100 coins and see 53 heads, $\bar{X}_{100} = 0.53$. It's the most obvious estimator for "probability of heads."

1.2 Bias and Variance

We compute the expected value and variance of this estimator:

$$ E_\theta(\bar{X}_n) = \frac{1}{n}\sum_{i=1}^{n} E_\theta(X_i) = \frac{1}{n} \cdot n\theta = \theta $$

So $\text{bias}(\hat{\theta}_n) = E_\theta(\hat{\theta}_n) - \theta = 0$. The estimator is unbiased — on average, it hits the right answer.

$$ V_\theta(\bar{X}_n) = \frac{1}{n^2}\sum_{i=1}^{n} V_\theta(X_i) = \frac{1}{n^2} \cdot n\theta(1-\theta) = \frac{\theta(1-\theta)}{n} $$
Intuition

The variance shrinks like $1/n$. With 10 flips, the variance is $\theta(1-\theta)/10$. With 10,000 flips, it's $\theta(1-\theta)/10000$ — a thousand times smaller. More data = more precision. This is a recurring theme throughout statistics.

1.3 Asymptotic Normality via the CLT

Since $X_1, \dots, X_n$ are i.i.d. with mean $\theta$ and variance $\theta(1-\theta)$, the Central Limit Theorem (from Lecture 2) directly gives us:

$$ \frac{\hat{\theta}_n - \theta}{\text{se}(\hat{\theta}_n)} = \frac{\sqrt{n}(\bar{X}_n - \theta)}{\sqrt{\theta(1-\theta)}} \xrightarrow{d} N(0,1) $$

This tells us that for large $n$, the standardized estimator behaves approximately like a standard normal distribution.

Example — Numbers

Suppose $\theta = 0.3$ (the coin is biased) and $n = 400$. Then $\bar{X}_{400}$ is approximately $N(0.3, \; 0.3 \times 0.7 / 400) = N(0.3,\; 0.000525)$. The standard deviation is $\sqrt{0.000525} \approx 0.023$. So about 95% of the time, $\bar{X}_{400}$ will fall within $0.3 \pm 0.046$, i.e., between 0.254 and 0.346.

2. Confidence Sets (Intervals)

2.1 What Is a Confidence Interval?

Definition — Confidence Set

Fix $\alpha \in (0,1)$. A random set $C_n$ is a $(1-\alpha)$-confidence set for $\theta$ if

$$ P_\theta(\theta \in C_n) \geq 1 - \alpha \quad \text{for all } \theta \in \Theta. $$

If the inequality holds only in the limit, i.e., $\lim_{n \to \infty} P_\theta(\theta \in C_n) \geq 1 - \alpha$, then $C_n$ is an asymptotic $(1-\alpha)$-confidence set.

Intuition — What "95% confidence" really means

A 95% confidence interval does not mean "there is a 95% chance $\theta$ is in this interval." The parameter $\theta$ is a fixed (unknown) number, not random. What's random is the interval — it depends on the data we happened to observe.

The correct interpretation: if we repeated the experiment many times, each time computing a new interval from fresh data, about 95% of those intervals would contain the true $\theta$.

Example — Analogy

Imagine throwing a lasso at a fence post (the true $\theta$). The lasso is your confidence interval — its size and position depend on how you throw it (the random data). "95% confidence" means your lasso-throwing technique catches the post 95% of the time. For any single throw, either you caught it or you didn't, but your method is reliable 95% of the time.

2.2 Constructing a Confidence Interval for the Bernoulli Parameter

We showed in Section 1 that $\sqrt{n}(\bar{X}_n - \theta)/\sqrt{\theta(1-\theta)} \xrightarrow{d} N(0,1)$. The problem is that $\theta$ appears in the denominator — but $\theta$ is what we're trying to estimate! The trick is to replace $\theta$ with $\bar{X}_n$ and show that this substitution doesn't break the result.

Step 1: Replace the unknown $\theta$ in the standard error

By the Law of Large Numbers, $\bar{X}_n \xrightarrow{P} \theta$. By the Continuous Mapping Theorem (from Lecture 2), any continuous function of $\bar{X}_n$ converges in probability to that function of $\theta$:

$$ \frac{\sqrt{\theta(1-\theta)}}{\sqrt{\bar{X}_n(1-\bar{X}_n)}} \xrightarrow{P} 1 $$

Step 2: Apply Slutsky's Theorem

Slutsky's Theorem (Lecture 2) tells us: if $Z_n \xrightarrow{d} Z$ and $A_n \xrightarrow{P} a$, then $A_n Z_n \xrightarrow{d} aZ$. Applying this:

$$ \frac{\sqrt{n}(\bar{X}_n - \theta)}{\sqrt{\bar{X}_n(1-\bar{X}_n)}} = \underbrace{\frac{\sqrt{n}(\bar{X}_n - \theta)}{\sqrt{\theta(1-\theta)}}}_{\xrightarrow{d}\; N(0,1)} \cdot \underbrace{\frac{\sqrt{\theta(1-\theta)}}{\sqrt{\bar{X}_n(1-\bar{X}_n)}}}_{\xrightarrow{P}\; 1} \xrightarrow{d} N(0,1) $$
Key Insight

Slutsky's Theorem is the engine that lets us "plug in" estimates for unknown quantities in standard errors. You'll see this pattern again and again: derive an asymptotic result with the true parameter, then replace it with a consistent estimate and invoke Slutsky.

Step 3: Build the interval

Since the standardized statistic converges to $N(0,1)$, for large $n$ the probability that it falls between $-z_{1-\alpha/2}$ and $z_{1-\alpha/2}$ is approximately $1 - \alpha$:

$$ P_\theta\!\left(-z_{1-\alpha/2} \leq \frac{\sqrt{n}(\bar{X}_n - \theta)}{\sqrt{\bar{X}_n(1-\bar{X}_n)}} \leq z_{1-\alpha/2}\right) \xrightarrow{n\to\infty} 1 - \alpha $$

Rearranging for $\theta$, we get the interval:

Asymptotic $(1-\alpha)$-Confidence Interval for $\theta$ (Bernoulli)
$$ C_n = \left[\;\bar{X}_n - \frac{z_{1-\alpha/2}}{\sqrt{n}}\sqrt{\bar{X}_n(1-\bar{X}_n)},\quad \bar{X}_n + \frac{z_{1-\alpha/2}}{\sqrt{n}}\sqrt{\bar{X}_n(1-\bar{X}_n)}\;\right] $$

where $z_{1-\alpha/2} = \Phi^{-1}(1-\alpha/2)$ is the $(1-\alpha/2)$-quantile of $N(0,1)$.

Example — Computing a 95% CI

You flip a coin $n = 200$ times and observe 118 heads. Then $\bar{X}_{200} = 118/200 = 0.59$.

For a 95% CI, $\alpha = 0.05$, so $z_{0.975} \approx 1.96$. The margin of error is:

$$ \frac{1.96}{\sqrt{200}} \sqrt{0.59 \times 0.41} \approx \frac{1.96}{14.14} \times 0.492 \approx 0.0682 $$

So $C_{200} \approx [0.59 - 0.068,\; 0.59 + 0.068] = [0.522,\; 0.658]$. We are (asymptotically) 95% confident that the true probability of heads lies between 0.522 and 0.658.

2.3 Anatomy of the Confidence Interval

ComponentFormulaRole
Center$\bar{X}_n$Our best guess (point estimate)
Standard error$\sqrt{\bar{X}_n(1-\bar{X}_n)/n}$How much $\bar{X}_n$ typically varies
Critical value$z_{1-\alpha/2}$How many SEs to go out for desired coverage
Width$2 z_{1-\alpha/2} \cdot \text{SE}$Shrinks as $n$ grows (like $1/\sqrt{n}$)
Intuition — Why $1/\sqrt{n}$?

The interval width shrinks proportionally to $1/\sqrt{n}$. To cut the width in half, you need four times as much data. This is a fundamental law: precision improves with the square root of the sample size, not the sample size itself. Going from 100 to 400 observations halves the interval width; going from 100 to 10,000 shrinks it by a factor of 10.

2.4 Common Quantiles Reference

Confidence level $1-\alpha$$\alpha$$z_{1-\alpha/2}$
90%0.101.645
95%0.051.960
99%0.012.576

3. The Empirical CDF

3.1 From Theory to Data

Recall from Lecture 1 that the CDF of a random variable $X$ is $F_X(x) = P(X \leq x) = E[\mathbf{1}_{\{X \leq x\}}]$. This tells us everything about the distribution of $X$, but we usually don't know $F_X$. How can we estimate it from data?

Definition — Empirical CDF

Given i.i.d. observations $X_1, \dots, X_n$ from some distribution $F_X$, the empirical CDF is:

$$ \hat{F}_n(x) = \frac{1}{n}\sum_{i=1}^{n} \mathbf{1}_{\{X_i \leq x\}} $$

In words: $\hat{F}_n(x)$ is the fraction of observations that are $\leq x$.

Intuition

The empirical CDF is the simplest possible idea: to estimate $P(X \leq x)$, just count how many of your data points are $\leq x$ and divide by $n$. No model assumptions needed — this works for any distribution.

Example — Five data points

Suppose $n = 5$ and we observe $X_1 = 2,\; X_2 = 5,\; X_3 = 1,\; X_4 = 3,\; X_5 = 5$. Then:

  • $\hat{F}_5(0) = 0/5 = 0$ (no values $\leq 0$)
  • $\hat{F}_5(1) = 1/5 = 0.2$ (just $X_3 = 1$)
  • $\hat{F}_5(2.5) = 2/5 = 0.4$ (values 1 and 2)
  • $\hat{F}_5(5) = 5/5 = 1.0$ (all values)

$\hat{F}_5$ is a staircase function that jumps by $1/5$ at each observed data point.

3.2 Properties of the Empirical CDF

The empirical CDF has three remarkable properties:

Property 1 — Unbiasedness

For any fixed $x$: $\;E[\hat{F}_n(x)] = F_X(x)$.

Why? Each indicator $\mathbf{1}_{\{X_i \leq x\}}$ has expectation $P(X_i \leq x) = F_X(x)$. The average of $n$ things each with expectation $F_X(x)$ still has expectation $F_X(x)$.

Property 2 — Pointwise convergence (LLN + CLT)

By the Law of Large Numbers: $\;\hat{F}_n(x) \xrightarrow{P} F_X(x)$ for every $x$.

By the CLT: $\;\sqrt{n}\big(\hat{F}_n(x) - F_X(x)\big) \xrightarrow{d} N\big(0,\; F_X(x)(1 - F_X(x))\big)$.

Notice: For each fixed $x$, $\mathbf{1}_{\{X_i \leq x\}}$ is a Bernoulli random variable with parameter $p = F_X(x)$. So estimating $F_X(x)$ at a point is exactly a Bernoulli estimation problem!

Property 3 — Uniform convergence (Glivenko-Cantelli Theorem)
$$ \sup_{x \in \mathbb{R}} \left|\hat{F}_n(x) - F_X(x)\right| \xrightarrow{P} 0 $$

The empirical CDF converges to the true CDF everywhere at once, not just at a single point.

Intuition — Glivenko-Cantelli

Think of $\hat{F}_n$ as a jagged staircase and $F_X$ as a smooth curve. The Glivenko-Cantelli theorem says the staircase gets closer and closer to the smooth curve uniformly — the biggest gap between them (anywhere along the $x$-axis) shrinks to zero. This is sometimes called the "Fundamental Theorem of Statistics" because it means the empirical CDF is a universally consistent estimator of the true CDF.

4. Plug-in Estimators

4.1 The Idea

Many quantities of interest can be written as functions of the CDF $F$ — for example, the mean, the variance, and quantiles. We call such a function a statistical functional:

Definition — Statistical Functional & Plug-in Estimator

A statistical functional is a map $T$ from a set of CDFs to $\mathbb{R}^k$: $T(F)$ produces a number (or vector) that depends on the distribution $F$.

The plug-in estimator is simply $\hat{\theta}_n = T(\hat{F}_n)$ — compute $T$ using the empirical CDF instead of the true CDF.

Intuition — The Plug-in Principle

The recipe is: whatever formula you'd use if you knew the true distribution, just substitute the empirical distribution instead. Since $\hat{F}_n$ converges to $F$ (Glivenko-Cantelli), under mild conditions $T(\hat{F}_n)$ converges to $T(F)$.

4.2 Linear Functionals

If $T(F) = \int r(x)\, dF(x)$ for some fixed function $r$, then $T$ is called a linear functional. Its plug-in estimator is:

$$ T(\hat{F}_n) = \int r(x)\, d\hat{F}_n(x) = \frac{1}{n}\sum_{i=1}^{n} r(X_i) $$

This is just the sample average of $r(X_1), \dots, r(X_n)$.

4.3 Examples

Example — Moments (Linear)

The $m$-th moment is $T(F) = \int x^m \, dF(x)$, so $r(x) = x^m$. Its plug-in estimator is: $T(\hat{F}_n) = \frac{1}{n}\sum_{i=1}^{n} X_i^m$. For $m=1$, this is just the sample mean $\bar{X}_n$.

Example — Variance (Non-linear)

The variance is $T(F) = \int x^2\, dF(x) - \left(\int x\, dF(x)\right)^2$. This is a function of two linear functionals, but it is itself non-linear. The plug-in estimator is:

$$ T(\hat{F}_n) = \frac{1}{n}\sum_{i=1}^{n} X_i^2 - \bar{X}_n^2 = \frac{1}{n}\sum_{i=1}^{n} (X_i - \bar{X}_n)^2 $$

This is the (biased) sample variance. Note: the common "unbiased" version divides by $n-1$ instead of $n$, but the plug-in principle naturally gives the $1/n$ version.

Example — Quantiles (Non-linear)

The $q$-quantile is $T(F) = F^{-1}(q) = \inf\{x : F(x) \geq q\}$. The plug-in estimator is $\hat{F}_n^{-1}(q)$.

If the data has distinct values and we sort them as $X_{(1)} < X_{(2)} < \cdots < X_{(n)}$ (order statistics), then $\hat{F}_n^{-1}(q) = X_{(k)}$ where $k = \lceil nq \rceil$ (the smallest integer $\geq nq$).

For example, with $n = 20$ data points: the plug-in estimate of the median ($q = 0.5$) is $X_{(10)}$, the 10th smallest value. The 90th percentile ($q = 0.9$) is $X_{(18)}$, the 18th smallest.

5. The Bootstrap

5.1 The Problem

Suppose $\hat{\theta}_n = T(\hat{F}_n)$ is our plug-in estimator. We often want to compute something about its distribution, like its variance $V_F(\hat{\theta}_n)$ or its expected value $E_F(\hat{\theta}_n)$. But these depend on the unknown true distribution $F$ and might be analytically intractable.

Example — Why is this hard?

The sample median $\hat{\theta}_n = X_{(\lceil n/2 \rceil)}$: what is its variance? For the sample mean, we have a clean formula $V(\bar{X}_n) = \sigma^2/n$. For the median, the variance depends on the entire shape of $F$ around its center. There's no simple closed-form in general.

5.2 The Bootstrap Idea

The Bootstrap Principle

Since we can't compute $\Psi_F(\hat{\theta}_n)$ (some quantity depending on $F$), replace $F$ by $\hat{F}_n$ and compute $\Psi_{\hat{F}_n}(\hat{\theta}_n^*)$ instead. Since $\hat{F}_n$ is known (it's just the data), we can simulate from it as many times as we want.

Intuition — The World and Its Mirror

The real world works like this: Nature has an unknown distribution $F$. We get one dataset $X_1, \dots, X_n$ from $F$ and compute one estimator $\hat{\theta}_n$.

The bootstrap creates a mirror world: we pretend $\hat{F}_n$ is the true distribution. We can generate as many fake datasets from $\hat{F}_n$ as we want, compute the estimator on each, and study how it varies. The variation we see in the mirror world approximates the variation in the real world.

5.3 What Does "Sample from $\hat{F}_n$" Mean?

$\hat{F}_n$ is a discrete distribution that puts probability $1/n$ on each observed value $X_1, \dots, X_n$. So drawing $X_1^*, \dots, X_n^*$ from $\hat{F}_n$ means:

Sampling with Replacement

Pick $n$ values from $\{X_1, \dots, X_n\}$ uniformly at random with replacement. Each draw is independent, and any original observation can appear 0, 1, 2, or more times in a bootstrap sample.

Example — Bootstrap in action

Original data: $\{2, 5, 1, 3, 5\}$. Some possible bootstrap samples:

  • $\{5, 1, 5, 2, 5\}$ — the value 5 appeared 3 times, the value 3 not at all
  • $\{3, 3, 1, 2, 5\}$ — the value 3 appeared twice
  • $\{1, 1, 1, 1, 1\}$ — unlikely, but possible!

5.4 The Algorithm

To approximate $\Psi_F(\hat{\theta}_n) = E_F[h(\hat{\theta}_n)]$ for some function $h$:

  1. For $b = 1, \dots, B$ (where $B$ is large, e.g., 1000 or 10000):
    1. Draw a bootstrap sample $X_1^*, \dots, X_n^*$ by sampling with replacement from the original data.
    2. Compute the bootstrap replicate $\hat{\theta}_{n,b}^* = g(X_1^*, \dots, X_n^*)$.
  2. Approximate: $\;\Psi_{\hat{F}_n}(\hat{\theta}_n^*) \approx \frac{1}{B}\sum_{b=1}^{B} h(\hat{\theta}_{n,b}^*)$.

This works by the Law of Large Numbers: the average over the $B$ bootstrap replicates converges to the true bootstrap expectation as $B \to \infty$.

Example — Bootstrap estimate of variance

Want to estimate $V_F(\hat{\theta}_n)$? Use $h(\hat{\theta}) = (\hat{\theta} - \bar{\hat{\theta}}^*)^2$. In practice:

$$ \widehat{V}_{\text{boot}} = \frac{1}{B}\sum_{b=1}^{B} \left(\hat{\theta}_{n,b}^* - \overline{\hat{\theta}^*}\right)^2 \quad \text{where}\quad \overline{\hat{\theta}^*} = \frac{1}{B}\sum_{b=1}^{B} \hat{\theta}_{n,b}^* $$

This is just the sample variance of the bootstrap replicates.

5.5 Summary Diagram

F (unknown) sample X₁, …, Xₙ compute θ̂ₙ = g(X) REAL WORLD construct F̂ₙ F̂ₙ (known) resample X₁*, …, Xₙ* compute θ̂ₙ* = g(X*) BOOTSTRAP WORLD repeat B times Approximate Ψ ≈ (1/B) Σ h(θ̂*ₙ,b)
Figure 1: The bootstrap principle — mirroring the real sampling process using the empirical CDF.

6. Maximum Likelihood Estimation (MLE)

6.1 The Setup

We now return to parametric models. We have a parametric family $\mathcal{F} = \{f(x;\theta) : \theta \in \Theta\}$ and observe $X_1, \dots, X_n$ i.i.d. from $f(\cdot\,;\theta_0)$ where $\theta_0$ is the unknown true parameter.

6.2 Likelihood and Log-Likelihood

Definition — Likelihood Function

The likelihood function is the joint density of the observed data, viewed as a function of the parameter $\theta$:

$$ L_n(\theta) = \prod_{i=1}^{n} f(X_i; \theta) $$

The log-likelihood is $\ell_n(\theta) = \log L_n(\theta) = \sum_{i=1}^{n} \log f(X_i; \theta)$.

Intuition — What likelihood measures

$L_n(\theta)$ answers: "If the true parameter were $\theta$, how likely is it that we'd see exactly the data we observed?" It's not the probability that $\theta$ is the right parameter (that would be Bayesian thinking). It's the probability of the data given $\theta$.

We use the log-likelihood because products of many small numbers cause numerical issues, and sums are easier to differentiate than products.

Definition — Maximum Likelihood Estimator (MLE)

The MLE is the parameter value that maximizes the likelihood (or equivalently, the log-likelihood):

$$ \hat{\theta}_n \in \arg\max_{\theta \in \Theta}\; L_n(\theta) = \arg\max_{\theta \in \Theta}\; \ell_n(\theta) $$
Intuition — "Which θ makes my data least surprising?"

The MLE picks the parameter value under which the observed data would have been most probable. If you observed 7 heads and 3 tails, the MLE says "the coin bias that makes this data most likely is $\theta = 0.7$."

6.3 Complete Example: Bernoulli MLE

Let $X_1, \dots, X_n$ be i.i.d. $\sim \text{Bernoulli}(\theta_0)$, $\theta_0 \in (0,1)$ unknown. The density is $f(x;\theta) = \theta^x(1-\theta)^{1-x}$ for $x \in \{0,1\}$.

Step 1: Write the log-likelihood

$$ \ell_n(\theta) = \left(\sum_{i=1}^{n} X_i\right)\log\theta + \left(n - \sum_{i=1}^{n} X_i\right)\log(1-\theta) $$

Step 2: Find the stationary point

Set $\partial \ell_n/\partial\theta = 0$:

$$ \frac{\sum X_i}{\theta} - \frac{n - \sum X_i}{1-\theta} = 0 \quad \Longleftrightarrow \quad \theta = \frac{1}{n}\sum_{i=1}^{n} X_i = \bar{X}_n $$

Step 3: Verify it's a maximum

Since $\theta \mapsto \ell_n(\theta)$ is strictly concave on $(0,1)$ (the second derivative is negative), the stationary point is a global maximum. Therefore, $\hat{\theta}_n = \bar{X}_n$.

Example — Numbers

Flip a coin 50 times, observe 32 heads. Then $\hat{\theta}_{50} = \bar{X}_{50} = 32/50 = 0.64$. The MLE says the most likely coin bias given this data is 0.64.

6.4 Asymptotic Normality of the MLE

Under regularity conditions, the MLE has a beautiful asymptotic property.

Definition — Score Function

The score function is the derivative of the log-density with respect to $\theta$, evaluated at $\theta_0$:

$$ s(x;\theta_0) = \frac{\partial \log f(x;\theta)}{\partial \theta}\bigg|_{\theta = \theta_0} $$

A key property: $E_{\theta_0}[s(X;\theta_0)] = 0$ (the score has mean zero).

Intuition — Why does the score have mean zero?

At the true parameter $\theta_0$, the log-likelihood is "balanced" — on average, the derivative (gradient pointing toward better $\theta$'s) is zero because we're already at the right place. If it weren't zero, there would be a systematic direction to improve, contradicting that $\theta_0$ is the truth.

Definition — Fisher Information

The Fisher Information is the variance of the score:

$$ I(\theta_0) = V_{\theta_0}(s(X;\theta_0)) = E_{\theta_0}\!\left[s(X;\theta_0)^2\right] $$
Intuition — Fisher Information as "sharpness of the peak"

Fisher Information measures how sensitive the log-likelihood is to changes in $\theta$ around $\theta_0$. High $I(\theta_0)$ means the log-likelihood has a sharp peak at $\theta_0$ — the data is very informative about $\theta$ and we can estimate it precisely. Low $I(\theta_0)$ means a flat, broad peak — many $\theta$ values give similar likelihoods, so estimation is imprecise.

Theorem — Asymptotic Normality of the MLE

Under regularity conditions (smooth, well-behaved model):

$$ \sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N\!\left(0,\; \frac{1}{I(\theta_0)}\right) $$

The MLE is approximately normal, centered at the truth, with variance $\frac{1}{nI(\theta_0)}$.

Why $1/I(\theta_0)$? — The Cramér-Rao Connection

The Cramér-Rao lower bound (which you may encounter later) states that no unbiased estimator can have variance smaller than $1/(nI(\theta_0))$. The MLE achieves this bound asymptotically — in this sense, it's the best possible estimator for large samples.

6.5 Bernoulli Example: Fisher Information

For $X \sim \text{Bernoulli}(\theta_0)$, the score is:

$$ s(x;\theta_0) = \frac{x}{\theta_0} - \frac{1-x}{1-\theta_0} $$

Computing the variance of the score (see derivation in the notes):

$$ I(\theta_0) = V_{\theta_0}(s(X;\theta_0)) = \frac{1}{\theta_0(1-\theta_0)} $$

Therefore, the asymptotic normality result gives:

$$ \sqrt{n}(\bar{X}_n - \theta_0) \xrightarrow{d} N(0,\; \theta_0(1-\theta_0)) $$

which matches exactly what we get from the CLT. This confirms that the MLE for the Bernoulli model is asymptotically efficient.

Example — Fisher Information for different $\theta_0$

The Fisher Information $I(\theta_0) = 1/(\theta_0(1-\theta_0))$:

$\theta_0$$I(\theta_0)$Asymptotic variance $\theta_0(1-\theta_0)$Interpretation
0.540.25Hardest to estimate (most uncertainty)
0.111.10.09Easier — coin is clearly biased
0.011010.0099Very easy — almost always tails

When $\theta_0 = 0.5$, each flip gives the least information (maximum entropy). When $\theta_0$ is near 0 or 1, outcomes are predictable, so each flip is more informative about the exact value of $\theta_0$.

7. Connecting the Pieces

Here's a bird's-eye view of the estimation toolkit from this lecture:

MethodAssumes a parametric model?Key ideaBest for
Plug-in via $\hat{F}_n$ No Replace $F$ with $\hat{F}_n$ Moments, quantiles, any CDF-based quantity
Bootstrap No (can also be parametric) Simulate from $\hat{F}_n$ to approximate distributional properties Variance, confidence intervals, complex functionals
MLE Yes Maximize the likelihood $\prod f(X_i;\theta)$ Parametric models — optimal for large $n$
CLT-based CI Depends Use asymptotic normality + Slutsky Any setting where CLT applies

8. Key Takeaways