Introduction to Machine Learning

Course: Introduction to Machine Learning (ETH Zürich) Lecture 1 Fanny Yang, Andreas Krause

This opening lecture sets the stage for the entire course by answering the fundamental question: What is machine learning? It introduces the two main branches of ML — supervised and unsupervised learning — and surveys the kinds of problems each can tackle. Through concrete examples ranging from house price prediction to protein structure prediction and handwritten digit clustering, the lecture builds intuition for how data-driven methods learn patterns and make predictions. The key takeaway is a conceptual framework that organises all of ML into a small number of problem types, each with its own pipeline.

1. What Is Machine Learning?

At the highest level, machine learning is about building systems that learn from data rather than being explicitly programmed with rules. Instead of a human writing down every decision rule, an ML method is given examples (training data) and automatically finds patterns or rules that generalise to new, unseen inputs.

The course aims to give you two things by the end: (1) familiarity with common ML methods — meaning you can explain, apply, and evaluate them — and (2) reliable intuition for why and when different algorithms work well, including the ability to justify that intuition mathematically.

2. The Two Main Branches of ML

Machine learning problems are broadly divided into two categories depending on whether or not the training data comes with labels (i.e. "correct answers").

Definition — Supervised Learning

In supervised learning, the training data consists of input–output pairs $\{(\mathbf{x}_1, y_1), \dots, (\mathbf{x}_n, y_n)\}$. The goal is to learn a function $\hat{f}$ that maps inputs $\mathbf{x}$ to outputs $y$, so that the learned function generalises well to new, unseen inputs.

Definition — Unsupervised Learning

In unsupervised learning, the training data has no labels: it is simply a collection $\{\mathbf{x}_1, \dots, \mathbf{x}_n\}$. The goal is to discover hidden structure in the data — for example, groupings, compact representations, or generative rules — without being told what the "right answer" is.

3. Supervised Learning in Detail

3.1 The Supervised Learning Pipeline

The general supervised learning workflow has three steps:

  1. Step I — Collect: Gather training data as input–output pairs $(\mathbf{x}_i, y_i)$. For instance, houses with their attributes and known sale prices.
  2. Step II — Learn: Use an ML method to efficiently find a model $\hat{f} \in \mathcal{F}$ from some function class $\mathcal{F}$ that fits the training data well.
  3. Step III — Predict: Given a new test input $\mathbf{x}$, use the learned model to output a prediction $\hat{f}(\mathbf{x})$.
Training Data (x₁,y₁)…(xₙ,yₙ) ML Method find f̂ ∈ F that fits training data Model f̂ → prediction f̂(x) new input x
Figure 1: Simplified supervised learning pipeline — Collect, Learn, Predict.
Intuition

Think of supervised learning like studying for an exam with an answer key: you see both the questions and the correct answers during training, and your goal is to learn the underlying logic so you can answer new questions correctly.

3.2 Types of Supervised Learning Tasks

Supervised learning problems are further categorised by what kind of output $y$ you're trying to predict:

Task Output $y$ Examples
Regression Continuous value ($y \in \mathbb{R}$) House price prediction, crop yield forecasting, vital-sign trajectory forecasting
Classification Discrete class label ($y \in \{1, \dots, K\}$) Spam detection, image recognition (CIFAR-10), medical risk assessment
Structured Prediction Complex output beyond a scalar Machine translation (text → text), protein folding (amino acid sequence → 3D structure)
Example — Regression: House Price Prediction

Suppose you want to sell your house. You collect data on previously sold houses: each house $i$ has attributes $\mathbf{x}_i$ (size in m², number of rooms, distance to public transport, age) and a known sale price $y_i$. You then train a regression model $\hat{f}$ on this data. Given your house's attributes $\mathbf{x}$, the model predicts the average market price $\hat{f}(\mathbf{x})$ in CHF.

Example — Classification: Spam Detection

Each training example is an email $\mathbf{x}_i$ paired with a label $y_i \in \{\text{spam}, \text{not spam}\}$. The ML method learns a model that, given a new incoming email $\mathbf{x}$, predicts whether it is spam or not. This is an instance of binary classification since there are only two classes.

Example — Structured Prediction: Machine Translation

The input $\mathbf{x}$ is a sentence in one language and the output $y$ is the translated sentence in another language. The output is not a single number or class but a structured sequence of words, which makes this a structured prediction problem.

4. Unsupervised Learning in Detail

In unsupervised learning, we only have inputs $\{\mathbf{x}_1, \dots, \mathbf{x}_n\}$ with no associated labels. The goal is to discover interesting structure in the data itself. The lecture highlights several common goals and corresponding methods:

Goal What It Means ML Method
Anomaly detection Identify "unusual" data points that don't fit the general pattern Clustering, density estimation
Discovering latent variables Find hidden factors (features, classes) not directly observed Clustering, generative modelling
Compact representation Compress data into a lower-dimensional form while preserving essential information Dimensionality reduction (e.g. PCA)
Data generation Create new data points that look realistic (e.g. new images or songs) Generative modelling (e.g. VAEs, GANs)

4.1 Clustering

Clustering groups data points by similarity. Think of it as the unsupervised analog of classification: instead of being told which category each point belongs to, the algorithm discovers the categories on its own. Methods covered later in the course include k-means and Gaussian mixture models.

4.2 Dimensionality Reduction

High-dimensional data (e.g. a 28×28 pixel handwritten digit image lives in a 784-dimensional space) often has much lower intrinsic complexity. Dimensionality reduction techniques like PCA find a low-dimensional representation $\mathbf{z}_i \in \mathbb{R}^k$ (with $k \ll d$) that preserves the important structure. For example, projecting MNIST digits down to 2D using PCA + t-SNE reveals clusters corresponding to different digits — without ever looking at the labels.

4.3 Generative Modelling

Generative models learn to produce new data that looks like the training data. Unlike supervised learning, there is no single "best/correct" answer — the model aims to capture the overall distribution of the data. Examples include generating new celebrity face images or new songs in a certain style. A key idea is learning a compact latent representation $\mathbf{z} = (z_1, \dots, z_k)$ where each dimension captures a meaningful factor of variation (e.g. "baldness" or "gender" in faces).

Supervised vs. Unsupervised — The Connection

There is a natural parallel between the two branches. Classification (supervised, discrete labels) has its unsupervised counterpart in clustering. Regression (supervised, continuous output) has its counterpart in dimensionality reduction. Additionally, density estimation — estimating the underlying probability distribution of the data — is a core unsupervised task linked to generative modelling.

5. The Unsupervised Learning Pipeline

The unsupervised pipeline differs from the supervised one in that there are no labels during training, and the output of the model is different in nature. Instead of predicting a single target value, unsupervised models can produce various outputs:

Training Data x₁ … xₙ (no labels) ML Method find rules that fit training data Learned Rules characteristics of x or generated x'
Figure 2: Simplified unsupervised learning pipeline — no labels during training.

6. Course Roadmap Preview

The lecture concludes with a preview of the topics the course will cover. The syllabus is organised into three major blocks:

Key Takeaways