This lecture builds on the linear model foundations from Week 1. The main goals are to understand the statistical properties of the least squares estimator (unbiasedness, variance, optimality), prove the Gauss–Markov theorem, extend optimality to UMVU under Gaussian errors, and take the first steps towards inference (hypothesis testing and confidence intervals) in the linear model. The lecture also features a group game to build intuition about sampling variability in regression.
1. Recap from Week 1
Last week covered the statistical learning problem, the distinction between prediction and inference, the difference between an estimate (a fixed number computed from observed data) and an estimator (a random variable, since it depends on the random response $Y$). We set up the linear model $Y = X\beta + \varepsilon$ and derived the least squares estimate $\hat{\beta} = (X^\top X)^{-1}X^\top y$, explored the geometry of the projection matrix $P = X(X^\top X)^{-1}X^\top$ (which is idempotent), and saw that the fitted values are $\hat{y} = Py$ while the residuals are $r = (I - P)y$.
2. Group Game & Takeaways
The lecture began with a hands-on group game (~20 minutes) where teams of students collected data (e.g. height, shoe size, hand length) and fitted linear models in groups of about 10. The exercise was designed to build intuition about the following key concepts:
- The estimator $\hat{\beta}$ is random: Different groups of 10 students obtained different slope estimates for the same model, illustrating sampling variability.
- Model selection: Is hand length a better predictor than shoe size? Can we drop one? (Covered later in the course.)
- Hypothesis testing: How can we say if a slope is "significantly" different from zero?
- Confidence intervals: How would variability change if each group had 100 observations instead of 10?
- Multiple vs. single regression: The coefficient estimates of a predictor can change depending on what other predictors are in the model (see §1.3.4 in the script).
3. Properties of the Least Squares Estimator
Now we move from the estimate (lowercase $y$, fixed data) to the estimator (uppercase $Y$, random). The least squares estimator is $\hat{\beta} = (X^\top X)^{-1}X^\top Y$. Since $Y$ is random, $\hat{\beta}$ is also random — and we want to characterise its distribution.
3.1 Unbiasedness
Under the linear model $Y = X\beta + \varepsilon$ with $E[\varepsilon] = 0$, the proof of unbiasedness is straightforward:
In words: on average, across many hypothetical datasets drawn from the same model, the OLS estimator hits the true parameter value. There is no systematic over- or under-estimation.
3.2 Covariance of $\hat{\beta}$
We also need to know how much the estimator varies from sample to sample. The key derivation (shown step-by-step on the slides) proceeds by substituting $Y = X\beta + \varepsilon$ and using $\text{Cov}(\varepsilon) = \sigma^2 I$:
Under the linear model $Y = X\beta + \varepsilon$, $E[\varepsilon]=0$, $\text{Cov}(\varepsilon) = \sigma^2 I$:
- $E[\hat{\beta}] = \beta$ (unbiased)
- $\text{Cov}(\hat{\beta}) = \sigma^2(X^\top X)^{-1}$
- $\text{Cov}(\hat{Y}) = \sigma^2 P$, $\text{Cov}(\tilde{r}) = \sigma^2(I - P)$
- $\text{Var}(\tilde{r}_i) = \sigma^2(1 - P_{ii})$ — note that the residuals are not uncorrelated!
3.3 Estimating $\sigma^2$
The error variance $\sigma^2$ is unknown and must be estimated. The natural estimator uses the residuals:
Why divide by $n - p$ instead of $n$? Because fitting $p$ parameters "uses up" $p$ degrees of freedom. Formally, we can show $E[\hat{\sigma}^2] = \sigma^2$ by using the fact that $\sum \text{Var}(\tilde{r}_i) = \sigma^2(n - \text{tr}(P)) = \sigma^2(n-p)$. So dividing by $n-p$ makes the estimator unbiased.
4. Optimality: The Gauss–Markov Theorem
We now ask: is OLS the "best" we can do? The answer is yes — but only within a specific class of estimators and under specific assumptions.
Consider the linear model $Y = X\beta + \varepsilon$ with $\text{rank}(X) = p$, and assume:
- $E[\varepsilon] = 0$ (errors have mean zero)
- $\text{Cov}(\varepsilon) = \sigma^2 I$ (errors are uncorrelated with constant variance)
Under the Gauss–Markov assumptions, $\hat{\beta} = (X^\top X)^{-1}X^\top Y$ is the Best Linear Unbiased Estimator (BLUE) of $\beta$. That is, for any other linear unbiased estimator $\tilde{\beta}$ of $\beta$,
4.1 Proof Sketch
The proof is elegant and worth understanding in full. Here's the idea:
Any linear estimator can be written as $\tilde{\beta} = AY$ for some matrix $A \in \mathbb{R}^{p \times n}$. For it to be unbiased for all $\beta$, we need $AX = I_p$. Now decompose $A = (X^\top X)^{-1}X^\top + B$ for some matrix $B$. The unbiasedness condition then forces $BX = 0$.
Computing the covariance:
where the cross terms vanish because $BX = 0$. Since $BB^\top$ is always positive semidefinite, we get $\text{Cov}(\tilde{\beta}) - \text{Cov}(\hat{\beta}) = \sigma^2 BB^\top \succeq 0$. Equality holds only when $B = 0$, i.e., when the estimator is OLS.
4.2 Limitations of Gauss–Markov
The Gauss–Markov theorem has clear limitations:
- It assumes the linear model is correctly specified.
- It assumes errors are uncorrelated with constant variance $\sigma^2$.
- It only considers linear and unbiased estimators.
The key insight: we can relax the unbiasedness requirement! By allowing a little bias, we can sometimes achieve much lower overall error (mean squared error). This is the idea behind Ridge and LASSO regression, which will be covered later in the course.
4.3 EduApp Example: Exponential Errors
Consider $Y = \beta_0 + \beta_1 X_1 + \varepsilon$ where $\varepsilon \sim \text{Exp}(\lambda)$. At first glance this seems problematic since $E[\varepsilon] = 1/\lambda \neq 0$. But we can rewrite:
Now $E[\tilde{\varepsilon}] = 0$ and $\text{Var}(\tilde{\varepsilon}) = 1/\lambda^2$, and the errors are uncorrelated. So the Gauss–Markov assumptions are satisfied (with a shifted intercept), and OLS is still BLUE. The non-zero mean of the original error simply gets absorbed into the intercept.
5. The Gaussian Linear Model & UMVU
Can we do better than just "best among linear unbiased"? Yes — if we're willing to make a stronger distributional assumption.
5.1 Gaussian Errors: OLS = MLE
Suppose we strengthen our assumptions to $\varepsilon \sim N(0, \sigma^2 I)$, i.e., the errors are jointly normal. The likelihood function is then proportional to
Maximising this likelihood with respect to $\beta$ is equivalent to minimising $\|y - X\beta\|^2$ — which is exactly the least squares criterion. So under Gaussian errors, OLS and maximum likelihood estimation give the same answer.
5.2 UMVU via Cramér–Rao
Under the Gaussian linear model, $\hat{\beta}$ is a uniformly minimum variance unbiased (UMVU) estimator of $\beta$. This means it has the smallest variance among all unbiased estimators — not just linear ones.
The proof relies on the Cramér–Rao lower bound. The Fisher information matrix is $I(\beta) = \frac{1}{\sigma^2}X^\top X$, and the Cramér–Rao bound says that no unbiased estimator can have covariance smaller than $I(\beta)^{-1} = \sigma^2(X^\top X)^{-1}$. Since $\text{Cov}(\hat{\beta})$ equals exactly this bound, $\hat{\beta}$ is UMVU.
Think of this as a progression of increasingly strong results:
- Gauss–Markov assumptions only: $\hat{\beta}$ is BLUE (best among linear & unbiased).
- Gauss–Markov + Gaussian errors: $\hat{\beta}$ is UMVU (best among all unbiased), and also the MLE.
However, even UMVU does not mean universally optimal — biased estimators (Ridge/LASSO) can have lower MSE.
6. A Step Towards Inference
With the distributional results in hand, we can now do inference — constructing tests and confidence intervals for the regression coefficients.
6.1 Distributions under Gaussian Errors
Under $\varepsilon \sim N(0, \sigma^2 I)$, the following hold, and $\hat{\beta}$ and $\hat{\sigma}^2$ are independent:
- $\hat{\beta} \sim N_p\!\big(\beta,\; \sigma^2(X^\top X)^{-1}\big)$
- $\hat{\sigma}^2 \sim \frac{\sigma^2}{n-p}\,\chi^2_{n-p}$
These results are the foundation for all classical inference in the linear model: $t$-tests, $F$-tests, confidence intervals, and prediction intervals.
6.2 Testing Individual Coefficients: The $t$-Test
To test whether the $j$-th predictor is relevant, we test $H_{0,j}: \beta_j = 0$ against $H_{A,j}: \beta_j \neq 0$. Under $H_0$:
The denominator is the estimated standard error of $\hat{\beta}_j$. If $|T_j|$ is large (or equivalently the $p$-value is small), we reject $H_0$ and conclude the predictor has a significant effect — conditional on all other predictors being in the model.
An individual $t$-test for $H_{0,j}$ quantifies the effect of the $j$-th predictor after having subtracted the linear effect of all other predictors. This means it's possible for all individual $t$-tests to be non-significant even when the predictors collectively have a strong effect — especially when predictors are correlated.
6.3 Confidence Intervals
From the $t$-distribution result, we can construct a two-sided confidence interval for $\beta_j$:
which covers the true $\beta_j$ with probability $1 - \alpha$.
6.4 Practical Considerations
The normality assumption on $\varepsilon_i$ is often not exactly satisfied in practice. However, thanks to the central limit theorem, for large sample size $n$, the distributional results above remain approximately valid. For strongly non-Gaussian errors, robust methods may be preferable (though not covered in this course).
7. Reading R Output: Body Fat Example
The lecture included an example with body fat data ($n = 247$ men, $p = 13$ quantitative predictors) to practice reading R's summary(lm(...)) output. The key columns are:
| Column | What it shows | Formula |
|---|---|---|
| Estimate | $\hat{\beta}_j$ | $(X^\top X)^{-1}X^\top y$ |
| Std. Error | Estimated std. dev. of $\hat{\beta}_j$ | $\sqrt{\hat{\sigma}^2 (X^\top X)^{-1}_{jj}}$ |
| t value | Test statistic for $H_0: \beta_j = 0$ | Estimate / Std. Error |
| Pr(>|t|) | Two-sided $p$-value | From $t_{n-p}$ distribution |
The output also reports the residual standard error $\hat{\sigma} = \sqrt{\hat{\sigma}^2}$ with $n-p$ degrees of freedom, the $R^2$ and adjusted $R^2$ values, and the $F$-statistic for the global null hypothesis that all regression coefficients (except the intercept) are zero.
8. Summary: Levels of Assumptions & Guarantees
| Assumptions | What OLS achieves | Class of comparison |
|---|---|---|
| $E[\varepsilon]=0$, $\text{Cov}(\varepsilon)=\sigma^2 I$ (Gauss–Markov) | BLUE | Linear & unbiased estimators |
| Gauss–Markov + $\varepsilon \sim N(0,\sigma^2 I)$ | UMVU = MLE | All unbiased estimators |
| Relaxing unbiasedness | Ridge/LASSO can beat OLS in MSE | All estimators (incl. biased) |
Key Takeaways
- The OLS estimator $\hat{\beta}$ is random — different datasets from the same model give different estimates. Its expectation is $\beta$ (unbiased) and its covariance is $\sigma^2(X^\top X)^{-1}$.
- The Gauss–Markov theorem guarantees OLS is the best linear unbiased estimator under mild assumptions ($E[\varepsilon]=0$, $\text{Cov}(\varepsilon)=\sigma^2 I$). The proof uses a clever matrix decomposition $A = B + (X^\top X)^{-1}X^\top$.
- Under Gaussian errors, OLS is also the MLE and achieves UMVU status (optimal among all unbiased estimators) via the Cramér–Rao bound.
- The theorem's scope is limited: we can often do better by allowing bias (Ridge, LASSO) — this is the bias-variance tradeoff, explored later.
- The distributional results ($\hat{\beta} \sim N_p$, $\hat{\sigma}^2 \sim \chi^2$) enable classical inference: $t$-tests for individual coefficients, confidence intervals, and $F$-tests. Even without exact normality, the CLT provides asymptotic justification for large $n$.