This opening lecture sets the stage for the entire course by answering the fundamental question: What is machine learning? It introduces the two main branches of ML — supervised and unsupervised learning — and surveys the kinds of problems each can tackle. Through concrete examples ranging from house price prediction to protein structure prediction and handwritten digit clustering, the lecture builds intuition for how data-driven methods learn patterns and make predictions. The key takeaway is a conceptual framework that organises all of ML into a small number of problem types, each with its own pipeline.
1. What Is Machine Learning?
At the highest level, machine learning is about building systems that learn from data rather than being explicitly programmed with rules. Instead of a human writing down every decision rule, an ML method is given examples (training data) and automatically finds patterns or rules that generalise to new, unseen inputs.
The course aims to give you two things by the end: (1) familiarity with common ML methods — meaning you can explain, apply, and evaluate them — and (2) reliable intuition for why and when different algorithms work well, including the ability to justify that intuition mathematically.
2. The Two Main Branches of ML
Machine learning problems are broadly divided into two categories depending on whether or not the training data comes with labels (i.e. "correct answers").
In supervised learning, the training data consists of input–output pairs $\{(\mathbf{x}_1, y_1), \dots, (\mathbf{x}_n, y_n)\}$. The goal is to learn a function $\hat{f}$ that maps inputs $\mathbf{x}$ to outputs $y$, so that the learned function generalises well to new, unseen inputs.
In unsupervised learning, the training data has no labels: it is simply a collection $\{\mathbf{x}_1, \dots, \mathbf{x}_n\}$. The goal is to discover hidden structure in the data — for example, groupings, compact representations, or generative rules — without being told what the "right answer" is.
3. Supervised Learning in Detail
3.1 The Supervised Learning Pipeline
The general supervised learning workflow has three steps:
- Step I — Collect: Gather training data as input–output pairs $(\mathbf{x}_i, y_i)$. For instance, houses with their attributes and known sale prices.
- Step II — Learn: Use an ML method to efficiently find a model $\hat{f} \in \mathcal{F}$ from some function class $\mathcal{F}$ that fits the training data well.
- Step III — Predict: Given a new test input $\mathbf{x}$, use the learned model to output a prediction $\hat{f}(\mathbf{x})$.
Think of supervised learning like studying for an exam with an answer key: you see both the questions and the correct answers during training, and your goal is to learn the underlying logic so you can answer new questions correctly.
3.2 Types of Supervised Learning Tasks
Supervised learning problems are further categorised by what kind of output $y$ you're trying to predict:
| Task | Output $y$ | Examples |
|---|---|---|
| Regression | Continuous value ($y \in \mathbb{R}$) | House price prediction, crop yield forecasting, vital-sign trajectory forecasting |
| Classification | Discrete class label ($y \in \{1, \dots, K\}$) | Spam detection, image recognition (CIFAR-10), medical risk assessment |
| Structured Prediction | Complex output beyond a scalar | Machine translation (text → text), protein folding (amino acid sequence → 3D structure) |
Suppose you want to sell your house. You collect data on previously sold houses: each house $i$ has attributes $\mathbf{x}_i$ (size in m², number of rooms, distance to public transport, age) and a known sale price $y_i$. You then train a regression model $\hat{f}$ on this data. Given your house's attributes $\mathbf{x}$, the model predicts the average market price $\hat{f}(\mathbf{x})$ in CHF.
Each training example is an email $\mathbf{x}_i$ paired with a label $y_i \in \{\text{spam}, \text{not spam}\}$. The ML method learns a model that, given a new incoming email $\mathbf{x}$, predicts whether it is spam or not. This is an instance of binary classification since there are only two classes.
The input $\mathbf{x}$ is a sentence in one language and the output $y$ is the translated sentence in another language. The output is not a single number or class but a structured sequence of words, which makes this a structured prediction problem.
4. Unsupervised Learning in Detail
In unsupervised learning, we only have inputs $\{\mathbf{x}_1, \dots, \mathbf{x}_n\}$ with no associated labels. The goal is to discover interesting structure in the data itself. The lecture highlights several common goals and corresponding methods:
| Goal | What It Means | ML Method |
|---|---|---|
| Anomaly detection | Identify "unusual" data points that don't fit the general pattern | Clustering, density estimation |
| Discovering latent variables | Find hidden factors (features, classes) not directly observed | Clustering, generative modelling |
| Compact representation | Compress data into a lower-dimensional form while preserving essential information | Dimensionality reduction (e.g. PCA) |
| Data generation | Create new data points that look realistic (e.g. new images or songs) | Generative modelling (e.g. VAEs, GANs) |
4.1 Clustering
Clustering groups data points by similarity. Think of it as the unsupervised analog of classification: instead of being told which category each point belongs to, the algorithm discovers the categories on its own. Methods covered later in the course include k-means and Gaussian mixture models.
4.2 Dimensionality Reduction
High-dimensional data (e.g. a 28×28 pixel handwritten digit image lives in a 784-dimensional space) often has much lower intrinsic complexity. Dimensionality reduction techniques like PCA find a low-dimensional representation $\mathbf{z}_i \in \mathbb{R}^k$ (with $k \ll d$) that preserves the important structure. For example, projecting MNIST digits down to 2D using PCA + t-SNE reveals clusters corresponding to different digits — without ever looking at the labels.
4.3 Generative Modelling
Generative models learn to produce new data that looks like the training data. Unlike supervised learning, there is no single "best/correct" answer — the model aims to capture the overall distribution of the data. Examples include generating new celebrity face images or new songs in a certain style. A key idea is learning a compact latent representation $\mathbf{z} = (z_1, \dots, z_k)$ where each dimension captures a meaningful factor of variation (e.g. "baldness" or "gender" in faces).
There is a natural parallel between the two branches. Classification (supervised, discrete labels) has its unsupervised counterpart in clustering. Regression (supervised, continuous output) has its counterpart in dimensionality reduction. Additionally, density estimation — estimating the underlying probability distribution of the data — is a core unsupervised task linked to generative modelling.
5. The Unsupervised Learning Pipeline
The unsupervised pipeline differs from the supervised one in that there are no labels during training, and the output of the model is different in nature. Instead of predicting a single target value, unsupervised models can produce various outputs:
- Characteristics of a new input relative to training data: e.g. which cluster does it belong to? Is it an anomaly? What is its low-dimensional representation?
- Generated new data: given a prompt or random seed, produce a new data point that resembles the training distribution.
6. Course Roadmap Preview
The lecture concludes with a preview of the topics the course will cover. The syllabus is organised into three major blocks:
- Supervised learning: Linear regression and classification, optimization, model validation, bias-variance trade-off, regularisation, kernels, neural networks and deep learning basics.
- Unsupervised learning: Dimensionality reduction, representation learning, clustering, and generative modelling with neural networks.
- The statistical perspective: Probabilistic modelling (discriminative vs. generative), Bayesian decision theory, and formalising intuitions with mathematical statements and derivations.
Key Takeaways
- ML learns from data, not explicit rules. Instead of hand-coding decision logic, machine learning methods automatically extract patterns from training examples.
- Supervised vs. unsupervised is the first fork. The presence or absence of labels $y_i$ in the training data determines which branch of ML you're working in, and this choice shapes the entire pipeline.
- Supervised learning splits into regression, classification, and structured prediction. Regression predicts continuous values, classification predicts discrete labels, and structured prediction handles complex outputs like text or 3D structures.
- Unsupervised learning discovers hidden structure. Without labels, methods like clustering, dimensionality reduction, and generative modelling find groupings, compact representations, and data distributions.
- The supervised pipeline is Collect → Learn → Predict. You gather labelled training data, fit a model $\hat{f} \in \mathcal{F}$, and use it to make predictions on new inputs — this three-step framework recurs throughout the course.
- The course balances intuition with mathematical rigour. You should not only be able to apply ML methods, but also explain why and when they work, backed by mathematical justification.