This first lecture sets the stage for the entire course. It introduces the motivation behind computational statistics as a field, lays out the supervised learning framework that underpins most of the methods we'll study, and then dives into the first major topic: multiple linear regression and the least squares method. By the end of this week, you should understand the core setup of a statistical learning problem, be comfortable with the matrix formulation of linear regression, and have a geometric intuition for what least squares actually does.
1. Course Overview & Motivation
The course covers the statistical foundations behind widely used machine learning methods. The goal of ML is to program computers to learn information from data — and while these algorithms are most often used for prediction, they can also help us interpret the relationships between inputs and outputs.
These methods appear everywhere: finance, healthcare, marketing, tech, climatology, ecology, and more. You'll also encounter many synonymous terms in the wild — statistical learning, data science, data analytics, and big data all largely refer to overlapping sets of ideas.
Learning Objectives
The focus is on methodology over theory. By the end of the semester, you should be able to implement the statistical learning techniques covered in class using existing software (primarily R) or your own code. The emphasis is on algorithmic thinking, model formulation, simulation-based inference, and reproducible workflows — not on learning programming syntax for its own sake.
Many ML methods lean heavily toward prediction, which has seen impressive recent progress. But inference — understanding why a model makes the predictions it does, and quantifying uncertainty — remains equally important. This tension (prediction vs. interpretation) runs throughout the entire course. For further reading, see Breiman (2001), "Statistical Modeling: The Two Cultures."
Logistics
The course uses R as the primary programming language. The evaluation is a 100% on-site digital Moodle exam consisting of TRUE/FALSE, multiple-choice, and numeric-answer questions. Everything on the slides, in the script, and on the exercise sheets is examinable unless explicitly stated otherwise.
Key references include: An Introduction to Statistical Learning (ISL), The Elements of Statistical Learning (ESL), and Computer Age Statistical Inference (CASI).
2. The Statistical Learning Framework
Before diving into any specific method, we need to establish the general setup. This framework stays with us for the rest of the course.
We observe training data $(x_1, y_1), \ldots, (x_n, y_n)$ where each $x_i \in \mathbb{R}^p$ is a vector of predictors (also called covariates, features, or inputs) and each $y_i$ is a response (or output).
Types of Variables
Quantitative variables take values in $\mathbb{R}$ with a natural ordering — values close numerically are close in meaning (e.g., height, temperature). Qualitative variables (categorical/factors) take values in a finite set with no inherent ordering (e.g., "spam" vs. "not spam").
Types of Learning Tasks
When $Y$ takes values in $\mathbb{R}$, we have a regression problem. When $Y$ takes values in a finite set $\mathcal{G} = \{G_1, \ldots, G_K\}$, we have a classification problem. When the $y_i$'s are unobserved entirely, we enter the realm of unsupervised learning (e.g., clustering), which this course does not cover.
Two Goals of Statistical Learning
- Prediction: Given a new input $x_+$, predict the unobserved response $y_+$.
- Interpretation / Inference: Understand how $x_i$ influences $y_i$. To fully understand this link, we'd need $\Pr(Y \mid X)$, which is hard. Often we settle for $\mathbb{E}(Y \mid X)$.
Random Design vs. Fixed Design
There are two paradigms for how we think about the predictors. In the random-$X$ (random design) paradigm, we assume $(x_i, y_i)$ are i.i.d. realizations from a joint distribution $\Pr(X, Y)$. In the fixed-$X$ (fixed design) paradigm, the predictors $x_1, \ldots, x_n$ are treated as fixed non-random quantities and only the responses $Y_i$ are random.
For linear regression (our first topic), we use the fixed-$X$ paradigm. Later in the course we shift to the random-$X$ view, which makes it easier to define prediction risk as $R(f) = \mathbb{E}[L(Y, f(X))]$.
3. The Linear Model
Linear regression is one of the oldest and most important tools in statistics. Despite its simplicity, it remains highly relevant — and every more advanced method you'll see later in this course builds on or contrasts with linear regression in some way.
The linear model assumes that each observation can be written as:
In matrix form:
where $\mathbf{X}$ is an $n \times p$ design matrix, $\boldsymbol{\beta} \in \mathbb{R}^p$ is the coefficient vector, and $\boldsymbol{\varepsilon}$ is a vector of error terms.
The model is called "linear" because it is linear in the coefficients $\beta_j$. The predictor variables themselves can be nonlinear transformations of raw inputs! For example, $y_i = \beta_1 + \beta_2 \log(x_{i2}) + \beta_3 \sin(\pi x_{i3}) + \varepsilon_i$ is still a linear model. This is a subtle but important point.
Notation Conventions
The first column of $\mathbf{X}$ is typically a vector of $1$'s, making $\beta_1$ (or $\beta_0$) the intercept. We assume $n > p$ and that $\mathbf{X}$ has full column rank $p$ (i.e., the columns are linearly independent).
Estimate vs. Estimator
This distinction matters and is easy to mix up. An estimate $\hat{\beta} = \hat{\beta}(y)$ is a computed number — a deterministic function of the observed data $y$. An estimator $\hat{\beta} = \hat{\beta}(Y)$ is a random variable — it's a function of the random vector $Y$. The estimate is what you get after plugging in the data; the estimator is the abstract recipe. Different data realizations from the same model would yield different estimates.
4. The Least Squares Method
The central idea: find the coefficient vector $\boldsymbol{\beta}$ that makes the predicted values $\mathbf{X}\boldsymbol{\beta}$ as close as possible to the observed responses $\mathbf{y}$, measured by the sum of squared differences.
The least squares estimate is defined as:
The quantity being minimized, $\text{RSS}(\boldsymbol{\beta}) = \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2$, is called the residual sum of squares.
Deriving the Normal Equations
To find the minimum, we take the gradient of $\text{RSS}(\boldsymbol{\beta})$ with respect to $\boldsymbol{\beta}$ and set it to zero:
This gives us the normal equations:
If $\mathbf{X}$ has full column rank $p$, then $\mathbf{X}^\top\mathbf{X}$ is invertible and we get the unique closed-form solution:
The formula $(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ is great for theory but not how you compute it in practice. Inverting $\mathbf{X}^\top\mathbf{X}$ is numerically unstable. Instead, software uses the QR decomposition: factor $\mathbf{X} = QR$ and solve two triangular systems sequentially. This is much more stable.
Fitted Values and Residuals
Once we have $\hat{\boldsymbol{\beta}}$, we can compute the fitted values $\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}$ and the residuals $\mathbf{r} = \mathbf{y} - \hat{\mathbf{y}}$. The residuals are our estimates for the unknown errors $\varepsilon_i$.
The variance of the errors is estimated by:
The denominator is $n - p$ (not $n$) to make this estimator unbiased — we "lose" $p$ degrees of freedom from estimating $p$ parameters.
5. Linear Model Assumptions
The least squares formula works purely as an optimization procedure regardless of assumptions. But to say anything meaningful about the properties of the resulting estimator (unbiasedness, variance, optimality), we need assumptions.
- The model $\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ is correctly specified (equivalently, $\mathbb{E}(\mathbf{Y} \mid \mathbf{X}) = \mathbf{X}\boldsymbol{\beta}$).
- $\mathbb{E}(\boldsymbol{\varepsilon}) = \mathbf{0}$ — the errors have mean zero.
- $\text{Var}(\varepsilon_i) = \sigma^2$ for all $i$ — constant variance (homoscedasticity).
- $\text{Cov}(\boldsymbol{\varepsilon}) = \sigma^2 \mathbf{I}$ — errors are uncorrelated.
- (Optional, stronger) $\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})$ — Gaussian errors.
Assumptions 1–4 are known as the Gauss–Markov assumptions. They are sufficient to establish that OLS is the best linear unbiased estimator (BLUE). Adding assumption 5 (Gaussianity) unlocks stronger results: exact distributions for test statistics, maximum likelihood equivalence, and UMVU optimality.
6. Geometry of Least Squares
This is perhaps the most illuminating way to understand what least squares does. Think of everything as vectors in $\mathbb{R}^n$.
The response $\mathbf{y}$ is a vector in $\mathbb{R}^n$. The columns of $\mathbf{X}$ span a $p$-dimensional subspace $\mathcal{X} \subset \mathbb{R}^n$. Varying $\boldsymbol{\beta}$ over all of $\mathbb{R}^p$, the product $\mathbf{X}\boldsymbol{\beta}$ traces out every point in this subspace.
Least squares finds the $\hat{\boldsymbol{\beta}}$ such that $\mathbf{X}\hat{\boldsymbol{\beta}}$ is the closest point in $\mathcal{X}$ to $\mathbf{y}$ in Euclidean distance. But the closest point in a subspace is always the orthogonal projection — so:
The fitted values $\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}$ are the orthogonal projection of $\mathbf{y}$ onto the column space of $\mathbf{X}$. The residual vector $\mathbf{r} = \mathbf{y} - \hat{\mathbf{y}}$ is perpendicular to this column space.
The Projection (Hat) Matrix
We can write the projection explicitly as a matrix operation:
This matrix $\mathbf{P}$ is called the projection matrix (or hat matrix, because it "puts the hat on $\mathbf{y}$"). It has three key properties that fully characterize it as an orthogonal projection:
- Symmetric: $\mathbf{P}^\top = \mathbf{P}$
- Idempotent: $\mathbf{P}^2 = \mathbf{P}$ — projecting twice gives the same result as projecting once
- Trace: $\text{tr}(\mathbf{P}) = p$ — the trace equals the number of parameters
The residuals can similarly be expressed as $\mathbf{r} = (\mathbf{I} - \mathbf{P})\mathbf{y}$, where $\mathbf{I} - \mathbf{P}$ is the orthogonal projection onto the complement of the column space. The matrix $\mathbf{I} - \mathbf{P}$ is also symmetric and idempotent.
If you project a vector onto a subspace and then project the result again onto the same subspace, nothing changes — it's already there. That's exactly what $\mathbf{P}^2 = \mathbf{P}$ says. Think of shining a flashlight straight down onto a table: the shadow of a shadow is the same shadow.
7. A Peek Ahead: What the Least Squares Estimator Achieves
Under the Gauss–Markov assumptions (covered in depth in Week 2), the least squares estimator has several desirable properties. Here's a brief preview of what's coming, since these ideas were alluded to in the Week 1 slides:
- $\mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}$ — the OLS estimator is unbiased
- $\text{Cov}(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}$
- Gauss–Markov Theorem: among all linear unbiased estimators, $\hat{\boldsymbol{\beta}}$ has the smallest variance (it is BLUE — Best Linear Unbiased Estimator)
- Under the additional Gaussian assumption, $\hat{\boldsymbol{\beta}}$ is also the maximum likelihood estimator and is UMVU (uniformly minimum variance unbiased among all unbiased estimators, not just linear ones)
It's worth noting that the Gauss–Markov theorem, while elegant, only compares OLS to other linear unbiased estimators. Later in the course, you'll see that by allowing a small amount of bias (e.g., via Ridge or LASSO regression), you can sometimes achieve a lower mean squared error overall. This bias-variance tradeoff is a central theme.
Key Takeaways
- The supervised learning framework is built around training data $\{(x_i, y_i)\}_{i=1}^n$ with two main goals: prediction and interpretation. Regression handles continuous outputs; classification handles categorical ones.
- "Linear" means linear in the parameters $\beta_j$, not necessarily in the raw inputs. Transformations of the predictors are perfectly fine.
- Least squares minimizes $\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2$, yielding $\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ when $\mathbf{X}$ has full rank. This is equivalent to orthogonally projecting $\mathbf{y}$ onto the column space of $\mathbf{X}$.
- The hat matrix $\mathbf{P} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ is symmetric, idempotent, and has trace $p$. The residuals live in the orthogonal complement: $\mathbf{r} = (\mathbf{I} - \mathbf{P})\mathbf{y}$.
- Keep the estimate vs. estimator distinction clear: estimates are computed numbers, estimators are random variables. This distinction becomes critical when we discuss properties like unbiasedness and variance.