This lecture brings together the estimation machinery from Lecture 2 and shows how it works in practice. We start with a concrete worked example of asymptotic normality for the Bernoulli model, then build confidence intervals — a way to quantify our uncertainty. We then introduce the empirical CDF, the workhorse of non-parametric estimation, along with plug-in estimators and the bootstrap — a powerful computational trick for approximating quantities that are hard to compute analytically. Finally, we introduce Maximum Likelihood Estimation (MLE), the most widely used parametric estimation method.
1. Worked Example: Estimating a Coin's Bias
Let's start with a complete worked example that ties together bias, variance, MSE, and asymptotic normality from Lecture 2. Suppose we flip a coin $n$ times and want to estimate the probability of heads, $\theta$.
1.1 Setup
Let $X_1, \dots, X_n$ be i.i.d. $\sim \text{Bernoulli}(\theta)$ for $\theta \in (0,1)$. Each $X_i$ is 1 (heads) with probability $\theta$ and 0 (tails) with probability $1 - \theta$. The natural estimator is the sample mean:
$\bar{X}_n$ is literally the fraction of heads we observed. If we flip 100 coins and see 53 heads, $\bar{X}_{100} = 0.53$. It's the most obvious estimator for "probability of heads."
1.2 Bias and Variance
We compute the expected value and variance of this estimator:
So $\text{bias}(\hat{\theta}_n) = E_\theta(\hat{\theta}_n) - \theta = 0$. The estimator is unbiased — on average, it hits the right answer.
The variance shrinks like $1/n$. With 10 flips, the variance is $\theta(1-\theta)/10$. With 10,000 flips, it's $\theta(1-\theta)/10000$ — a thousand times smaller. More data = more precision. This is a recurring theme throughout statistics.
1.3 Asymptotic Normality via the CLT
Since $X_1, \dots, X_n$ are i.i.d. with mean $\theta$ and variance $\theta(1-\theta)$, the Central Limit Theorem (from Lecture 2) directly gives us:
This tells us that for large $n$, the standardized estimator behaves approximately like a standard normal distribution.
Suppose $\theta = 0.3$ (the coin is biased) and $n = 400$. Then $\bar{X}_{400}$ is approximately $N(0.3, \; 0.3 \times 0.7 / 400) = N(0.3,\; 0.000525)$. The standard deviation is $\sqrt{0.000525} \approx 0.023$. So about 95% of the time, $\bar{X}_{400}$ will fall within $0.3 \pm 0.046$, i.e., between 0.254 and 0.346.
2. Confidence Sets (Intervals)
2.1 What Is a Confidence Interval?
Fix $\alpha \in (0,1)$. A random set $C_n$ is a $(1-\alpha)$-confidence set for $\theta$ if
If the inequality holds only in the limit, i.e., $\lim_{n \to \infty} P_\theta(\theta \in C_n) \geq 1 - \alpha$, then $C_n$ is an asymptotic $(1-\alpha)$-confidence set.
A 95% confidence interval does not mean "there is a 95% chance $\theta$ is in this interval." The parameter $\theta$ is a fixed (unknown) number, not random. What's random is the interval — it depends on the data we happened to observe.
The correct interpretation: if we repeated the experiment many times, each time computing a new interval from fresh data, about 95% of those intervals would contain the true $\theta$.
Imagine throwing a lasso at a fence post (the true $\theta$). The lasso is your confidence interval — its size and position depend on how you throw it (the random data). "95% confidence" means your lasso-throwing technique catches the post 95% of the time. For any single throw, either you caught it or you didn't, but your method is reliable 95% of the time.
2.2 Constructing a Confidence Interval for the Bernoulli Parameter
We showed in Section 1 that $\sqrt{n}(\bar{X}_n - \theta)/\sqrt{\theta(1-\theta)} \xrightarrow{d} N(0,1)$. The problem is that $\theta$ appears in the denominator — but $\theta$ is what we're trying to estimate! The trick is to replace $\theta$ with $\bar{X}_n$ and show that this substitution doesn't break the result.
Step 1: Replace the unknown $\theta$ in the standard error
By the Law of Large Numbers, $\bar{X}_n \xrightarrow{P} \theta$. By the Continuous Mapping Theorem (from Lecture 2), any continuous function of $\bar{X}_n$ converges in probability to that function of $\theta$:
Step 2: Apply Slutsky's Theorem
Slutsky's Theorem (Lecture 2) tells us: if $Z_n \xrightarrow{d} Z$ and $A_n \xrightarrow{P} a$, then $A_n Z_n \xrightarrow{d} aZ$. Applying this:
Slutsky's Theorem is the engine that lets us "plug in" estimates for unknown quantities in standard errors. You'll see this pattern again and again: derive an asymptotic result with the true parameter, then replace it with a consistent estimate and invoke Slutsky.
Step 3: Build the interval
Since the standardized statistic converges to $N(0,1)$, for large $n$ the probability that it falls between $-z_{1-\alpha/2}$ and $z_{1-\alpha/2}$ is approximately $1 - \alpha$:
Rearranging for $\theta$, we get the interval:
where $z_{1-\alpha/2} = \Phi^{-1}(1-\alpha/2)$ is the $(1-\alpha/2)$-quantile of $N(0,1)$.
You flip a coin $n = 200$ times and observe 118 heads. Then $\bar{X}_{200} = 118/200 = 0.59$.
For a 95% CI, $\alpha = 0.05$, so $z_{0.975} \approx 1.96$. The margin of error is:
So $C_{200} \approx [0.59 - 0.068,\; 0.59 + 0.068] = [0.522,\; 0.658]$. We are (asymptotically) 95% confident that the true probability of heads lies between 0.522 and 0.658.
2.3 Anatomy of the Confidence Interval
| Component | Formula | Role |
|---|---|---|
| Center | $\bar{X}_n$ | Our best guess (point estimate) |
| Standard error | $\sqrt{\bar{X}_n(1-\bar{X}_n)/n}$ | How much $\bar{X}_n$ typically varies |
| Critical value | $z_{1-\alpha/2}$ | How many SEs to go out for desired coverage |
| Width | $2 z_{1-\alpha/2} \cdot \text{SE}$ | Shrinks as $n$ grows (like $1/\sqrt{n}$) |
The interval width shrinks proportionally to $1/\sqrt{n}$. To cut the width in half, you need four times as much data. This is a fundamental law: precision improves with the square root of the sample size, not the sample size itself. Going from 100 to 400 observations halves the interval width; going from 100 to 10,000 shrinks it by a factor of 10.
2.4 Common Quantiles Reference
| Confidence level $1-\alpha$ | $\alpha$ | $z_{1-\alpha/2}$ |
|---|---|---|
| 90% | 0.10 | 1.645 |
| 95% | 0.05 | 1.960 |
| 99% | 0.01 | 2.576 |
3. The Empirical CDF
3.1 From Theory to Data
Recall from Lecture 1 that the CDF of a random variable $X$ is $F_X(x) = P(X \leq x) = E[\mathbf{1}_{\{X \leq x\}}]$. This tells us everything about the distribution of $X$, but we usually don't know $F_X$. How can we estimate it from data?
Given i.i.d. observations $X_1, \dots, X_n$ from some distribution $F_X$, the empirical CDF is:
In words: $\hat{F}_n(x)$ is the fraction of observations that are $\leq x$.
The empirical CDF is the simplest possible idea: to estimate $P(X \leq x)$, just count how many of your data points are $\leq x$ and divide by $n$. No model assumptions needed — this works for any distribution.
Suppose $n = 5$ and we observe $X_1 = 2,\; X_2 = 5,\; X_3 = 1,\; X_4 = 3,\; X_5 = 5$. Then:
- $\hat{F}_5(0) = 0/5 = 0$ (no values $\leq 0$)
- $\hat{F}_5(1) = 1/5 = 0.2$ (just $X_3 = 1$)
- $\hat{F}_5(2.5) = 2/5 = 0.4$ (values 1 and 2)
- $\hat{F}_5(5) = 5/5 = 1.0$ (all values)
$\hat{F}_5$ is a staircase function that jumps by $1/5$ at each observed data point.
3.2 Properties of the Empirical CDF
The empirical CDF has three remarkable properties:
For any fixed $x$: $\;E[\hat{F}_n(x)] = F_X(x)$.
Why? Each indicator $\mathbf{1}_{\{X_i \leq x\}}$ has expectation $P(X_i \leq x) = F_X(x)$. The average of $n$ things each with expectation $F_X(x)$ still has expectation $F_X(x)$.
By the Law of Large Numbers: $\;\hat{F}_n(x) \xrightarrow{P} F_X(x)$ for every $x$.
By the CLT: $\;\sqrt{n}\big(\hat{F}_n(x) - F_X(x)\big) \xrightarrow{d} N\big(0,\; F_X(x)(1 - F_X(x))\big)$.
Notice: For each fixed $x$, $\mathbf{1}_{\{X_i \leq x\}}$ is a Bernoulli random variable with parameter $p = F_X(x)$. So estimating $F_X(x)$ at a point is exactly a Bernoulli estimation problem!
The empirical CDF converges to the true CDF everywhere at once, not just at a single point.
Think of $\hat{F}_n$ as a jagged staircase and $F_X$ as a smooth curve. The Glivenko-Cantelli theorem says the staircase gets closer and closer to the smooth curve uniformly — the biggest gap between them (anywhere along the $x$-axis) shrinks to zero. This is sometimes called the "Fundamental Theorem of Statistics" because it means the empirical CDF is a universally consistent estimator of the true CDF.
4. Plug-in Estimators
4.1 The Idea
Many quantities of interest can be written as functions of the CDF $F$ — for example, the mean, the variance, and quantiles. We call such a function a statistical functional:
A statistical functional is a map $T$ from a set of CDFs to $\mathbb{R}^k$: $T(F)$ produces a number (or vector) that depends on the distribution $F$.
The plug-in estimator is simply $\hat{\theta}_n = T(\hat{F}_n)$ — compute $T$ using the empirical CDF instead of the true CDF.
The recipe is: whatever formula you'd use if you knew the true distribution, just substitute the empirical distribution instead. Since $\hat{F}_n$ converges to $F$ (Glivenko-Cantelli), under mild conditions $T(\hat{F}_n)$ converges to $T(F)$.
4.2 Linear Functionals
If $T(F) = \int r(x)\, dF(x)$ for some fixed function $r$, then $T$ is called a linear functional. Its plug-in estimator is:
This is just the sample average of $r(X_1), \dots, r(X_n)$.
4.3 Examples
The $m$-th moment is $T(F) = \int x^m \, dF(x)$, so $r(x) = x^m$. Its plug-in estimator is: $T(\hat{F}_n) = \frac{1}{n}\sum_{i=1}^{n} X_i^m$. For $m=1$, this is just the sample mean $\bar{X}_n$.
The variance is $T(F) = \int x^2\, dF(x) - \left(\int x\, dF(x)\right)^2$. This is a function of two linear functionals, but it is itself non-linear. The plug-in estimator is:
This is the (biased) sample variance. Note: the common "unbiased" version divides by $n-1$ instead of $n$, but the plug-in principle naturally gives the $1/n$ version.
The $q$-quantile is $T(F) = F^{-1}(q) = \inf\{x : F(x) \geq q\}$. The plug-in estimator is $\hat{F}_n^{-1}(q)$.
If the data has distinct values and we sort them as $X_{(1)} < X_{(2)} < \cdots < X_{(n)}$ (order statistics), then $\hat{F}_n^{-1}(q) = X_{(k)}$ where $k = \lceil nq \rceil$ (the smallest integer $\geq nq$).
For example, with $n = 20$ data points: the plug-in estimate of the median ($q = 0.5$) is $X_{(10)}$, the 10th smallest value. The 90th percentile ($q = 0.9$) is $X_{(18)}$, the 18th smallest.
5. The Bootstrap
5.1 The Problem
Suppose $\hat{\theta}_n = T(\hat{F}_n)$ is our plug-in estimator. We often want to compute something about its distribution, like its variance $V_F(\hat{\theta}_n)$ or its expected value $E_F(\hat{\theta}_n)$. But these depend on the unknown true distribution $F$ and might be analytically intractable.
The sample median $\hat{\theta}_n = X_{(\lceil n/2 \rceil)}$: what is its variance? For the sample mean, we have a clean formula $V(\bar{X}_n) = \sigma^2/n$. For the median, the variance depends on the entire shape of $F$ around its center. There's no simple closed-form in general.
5.2 The Bootstrap Idea
Since we can't compute $\Psi_F(\hat{\theta}_n)$ (some quantity depending on $F$), replace $F$ by $\hat{F}_n$ and compute $\Psi_{\hat{F}_n}(\hat{\theta}_n^*)$ instead. Since $\hat{F}_n$ is known (it's just the data), we can simulate from it as many times as we want.
The real world works like this: Nature has an unknown distribution $F$. We get one dataset $X_1, \dots, X_n$ from $F$ and compute one estimator $\hat{\theta}_n$.
The bootstrap creates a mirror world: we pretend $\hat{F}_n$ is the true distribution. We can generate as many fake datasets from $\hat{F}_n$ as we want, compute the estimator on each, and study how it varies. The variation we see in the mirror world approximates the variation in the real world.
5.3 What Does "Sample from $\hat{F}_n$" Mean?
$\hat{F}_n$ is a discrete distribution that puts probability $1/n$ on each observed value $X_1, \dots, X_n$. So drawing $X_1^*, \dots, X_n^*$ from $\hat{F}_n$ means:
Pick $n$ values from $\{X_1, \dots, X_n\}$ uniformly at random with replacement. Each draw is independent, and any original observation can appear 0, 1, 2, or more times in a bootstrap sample.
Original data: $\{2, 5, 1, 3, 5\}$. Some possible bootstrap samples:
- $\{5, 1, 5, 2, 5\}$ — the value 5 appeared 3 times, the value 3 not at all
- $\{3, 3, 1, 2, 5\}$ — the value 3 appeared twice
- $\{1, 1, 1, 1, 1\}$ — unlikely, but possible!
5.4 The Algorithm
To approximate $\Psi_F(\hat{\theta}_n) = E_F[h(\hat{\theta}_n)]$ for some function $h$:
- For $b = 1, \dots, B$ (where $B$ is large, e.g., 1000 or 10000):
- Draw a bootstrap sample $X_1^*, \dots, X_n^*$ by sampling with replacement from the original data.
- Compute the bootstrap replicate $\hat{\theta}_{n,b}^* = g(X_1^*, \dots, X_n^*)$.
- Approximate: $\;\Psi_{\hat{F}_n}(\hat{\theta}_n^*) \approx \frac{1}{B}\sum_{b=1}^{B} h(\hat{\theta}_{n,b}^*)$.
This works by the Law of Large Numbers: the average over the $B$ bootstrap replicates converges to the true bootstrap expectation as $B \to \infty$.
Want to estimate $V_F(\hat{\theta}_n)$? Use $h(\hat{\theta}) = (\hat{\theta} - \bar{\hat{\theta}}^*)^2$. In practice:
This is just the sample variance of the bootstrap replicates.
5.5 Summary Diagram
6. Maximum Likelihood Estimation (MLE)
6.1 The Setup
We now return to parametric models. We have a parametric family $\mathcal{F} = \{f(x;\theta) : \theta \in \Theta\}$ and observe $X_1, \dots, X_n$ i.i.d. from $f(\cdot\,;\theta_0)$ where $\theta_0$ is the unknown true parameter.
6.2 Likelihood and Log-Likelihood
The likelihood function is the joint density of the observed data, viewed as a function of the parameter $\theta$:
The log-likelihood is $\ell_n(\theta) = \log L_n(\theta) = \sum_{i=1}^{n} \log f(X_i; \theta)$.
$L_n(\theta)$ answers: "If the true parameter were $\theta$, how likely is it that we'd see exactly the data we observed?" It's not the probability that $\theta$ is the right parameter (that would be Bayesian thinking). It's the probability of the data given $\theta$.
We use the log-likelihood because products of many small numbers cause numerical issues, and sums are easier to differentiate than products.
The MLE is the parameter value that maximizes the likelihood (or equivalently, the log-likelihood):
The MLE picks the parameter value under which the observed data would have been most probable. If you observed 7 heads and 3 tails, the MLE says "the coin bias that makes this data most likely is $\theta = 0.7$."
6.3 Complete Example: Bernoulli MLE
Let $X_1, \dots, X_n$ be i.i.d. $\sim \text{Bernoulli}(\theta_0)$, $\theta_0 \in (0,1)$ unknown. The density is $f(x;\theta) = \theta^x(1-\theta)^{1-x}$ for $x \in \{0,1\}$.
Step 1: Write the log-likelihood
Step 2: Find the stationary point
Set $\partial \ell_n/\partial\theta = 0$:
Step 3: Verify it's a maximum
Since $\theta \mapsto \ell_n(\theta)$ is strictly concave on $(0,1)$ (the second derivative is negative), the stationary point is a global maximum. Therefore, $\hat{\theta}_n = \bar{X}_n$.
Flip a coin 50 times, observe 32 heads. Then $\hat{\theta}_{50} = \bar{X}_{50} = 32/50 = 0.64$. The MLE says the most likely coin bias given this data is 0.64.
6.4 Asymptotic Normality of the MLE
Under regularity conditions, the MLE has a beautiful asymptotic property.
The score function is the derivative of the log-density with respect to $\theta$, evaluated at $\theta_0$:
A key property: $E_{\theta_0}[s(X;\theta_0)] = 0$ (the score has mean zero).
At the true parameter $\theta_0$, the log-likelihood is "balanced" — on average, the derivative (gradient pointing toward better $\theta$'s) is zero because we're already at the right place. If it weren't zero, there would be a systematic direction to improve, contradicting that $\theta_0$ is the truth.
The Fisher Information is the variance of the score:
Fisher Information measures how sensitive the log-likelihood is to changes in $\theta$ around $\theta_0$. High $I(\theta_0)$ means the log-likelihood has a sharp peak at $\theta_0$ — the data is very informative about $\theta$ and we can estimate it precisely. Low $I(\theta_0)$ means a flat, broad peak — many $\theta$ values give similar likelihoods, so estimation is imprecise.
Under regularity conditions (smooth, well-behaved model):
The MLE is approximately normal, centered at the truth, with variance $\frac{1}{nI(\theta_0)}$.
The Cramér-Rao lower bound (which you may encounter later) states that no unbiased estimator can have variance smaller than $1/(nI(\theta_0))$. The MLE achieves this bound asymptotically — in this sense, it's the best possible estimator for large samples.
6.5 Bernoulli Example: Fisher Information
For $X \sim \text{Bernoulli}(\theta_0)$, the score is:
Computing the variance of the score (see derivation in the notes):
Therefore, the asymptotic normality result gives:
which matches exactly what we get from the CLT. This confirms that the MLE for the Bernoulli model is asymptotically efficient.
The Fisher Information $I(\theta_0) = 1/(\theta_0(1-\theta_0))$:
| $\theta_0$ | $I(\theta_0)$ | Asymptotic variance $\theta_0(1-\theta_0)$ | Interpretation |
|---|---|---|---|
| 0.5 | 4 | 0.25 | Hardest to estimate (most uncertainty) |
| 0.1 | 11.1 | 0.09 | Easier — coin is clearly biased |
| 0.01 | 101 | 0.0099 | Very easy — almost always tails |
When $\theta_0 = 0.5$, each flip gives the least information (maximum entropy). When $\theta_0$ is near 0 or 1, outcomes are predictable, so each flip is more informative about the exact value of $\theta_0$.
7. Connecting the Pieces
Here's a bird's-eye view of the estimation toolkit from this lecture:
| Method | Assumes a parametric model? | Key idea | Best for |
|---|---|---|---|
| Plug-in via $\hat{F}_n$ | No | Replace $F$ with $\hat{F}_n$ | Moments, quantiles, any CDF-based quantity |
| Bootstrap | No (can also be parametric) | Simulate from $\hat{F}_n$ to approximate distributional properties | Variance, confidence intervals, complex functionals |
| MLE | Yes | Maximize the likelihood $\prod f(X_i;\theta)$ | Parametric models — optimal for large $n$ |
| CLT-based CI | Depends | Use asymptotic normality + Slutsky | Any setting where CLT applies |
8. Key Takeaways
- Confidence intervals quantify estimation uncertainty. A 95% CI means our procedure captures $\theta$ 95% of the time — it's about the method's reliability, not a probability statement about $\theta$ itself. The typical form is $\hat{\theta} \pm z_{1-\alpha/2} \cdot \text{SE}$.
- The empirical CDF $\hat{F}_n$ is the foundation of non-parametric estimation. It converges uniformly to the true CDF (Glivenko-Cantelli), and for any fixed point it behaves like a Bernoulli estimator.
- Plug-in estimators follow a simple recipe: whatever formula works with the true $F$, use $\hat{F}_n$ instead. For linear functionals, this gives sample averages.
- The bootstrap lets you approximate any distributional quantity (variance, bias, confidence intervals) by resampling from the data — no formulas needed, just computing power.
- Maximum Likelihood Estimation is the go-to method for parametric models. The MLE maximizes "how likely is my data under parameter $\theta$?" and is asymptotically normal with the smallest possible variance (Cramér-Rao bound).