This lecture lays the probabilistic groundwork that everything in statistical learning builds upon. We start from the very beginning — what is randomness? — and build up to random variables, their distributions, and the key tools we need to describe and work with uncertain quantities. Think of this lecture as assembling a toolkit: sample spaces, events, probabilities, random variables, CDFs, and density/mass functions are the core instruments you'll reach for in every later topic.
1. The Sample Space
Before we can talk about probabilities, we need to define what can happen. The sample space $\Omega$ is simply the set of all possible outcomes of a random experiment. Every probabilistic statement we ever make is, at its core, a statement about subsets of $\Omega$.
The sample space $\Omega$ is the set of all possible outcomes of a random experiment. Individual elements $\omega \in \Omega$ are called sample outcomes (or realizations, or elementary events).
Sample spaces come in different "sizes":
- Finite: $\Omega = \{\omega_1, \ldots, \omega_N\}$ for some $N \in \mathbb{N}$.
- Countably infinite: $\Omega = \{\omega_1, \omega_2, \ldots \}$ — you can list the outcomes one by one, but the list never ends.
- Uncountable: $\Omega = \mathbb{R}$ (or an interval like $(a, b]$) — outcomes form a continuum.
Toss a coin twice. Each toss yields Heads (H) or Tails (T). The sample space is:
This is a finite sample space with $|\Omega| = 4$ outcomes.
Roll a standard six-sided die once:
Again finite, with 6 outcomes.
Measure the temperature outside right now. In principle the result could be any real number, so $\Omega = \mathbb{R}$ (uncountable). In practice you might restrict to $\Omega = [-50, 60]$ degrees Celsius, but mathematically we treat it as a continuum.
2. Events
An event is any subset of $\Omega$. Intuitively, an event is a "question" you can ask about the outcome of the experiment — and the answer is either "yes, it happened" or "no, it didn't."
A subset $A \subseteq \Omega$ is called an event. We say the event $A$ occurred if the observed outcome $\omega$ belongs to $A$.
Since events are sets, all the usual set operations apply:
- $A^c$ — the complement ("$A$ did not happen").
- $A \cup B$ — the union ("at least one of $A$ or $B$ happened").
- $A \cap B$ — the intersection ("both $A$ and $B$ happened").
- $A \setminus B = A \cap B^c$ — the difference ("$A$ happened but $B$ did not").
Let $A$ = "the first toss is H". Then:
Let $B$ = "at least one toss is T". Then $B = \{(H,T),(T,H),(T,T)\}$ and $A \cap B = \{(H,T)\}$ — the first toss was H and at least one toss was T.
Take $\Omega = \mathbb{R}$ and define $A_i = [0, \tfrac{1}{i})$ for $i = 1, 2, 3, \ldots$. Each subsequent interval is smaller: $A_1 = [0,1)$, $A_2 = [0,\tfrac{1}{2})$, $A_3 = [0,\tfrac{1}{3})$, and so on. Then:
The union is the largest interval $A_1$. The intersection shrinks to the single point that belongs to every interval — namely $0$.
3. Probability Measure
We now want to assign a number to each event that captures "how likely" it is. A probability measure $P$ does exactly this — it's a function that takes an event and returns a number between 0 and 1, following three intuitive rules.
A probability measure $P$ is a function on events satisfying:
- Non-negativity: $P(A) \geq 0$ for every event $A$.
- Normalization: $P(\Omega) = 1$ — something always happens.
- $\sigma$-Additivity: If $A_1, A_2, \ldots$ are disjoint events (no two can happen simultaneously), then $P\!\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i)$.
Intuition: Axiom 3 is the powerful one. It says that if events don't overlap, the probability of "any of them happening" is just the sum of their individual probabilities. This is the foundation for most probability calculations.
Properties That Follow
From these three axioms alone, we can derive several useful facts:
- Impossible event: $P(\varnothing) = 0$.
- Complement rule: $P(A^c) = 1 - P(A)$. (Often the easiest way to compute a probability — find the probability of the opposite!)
- Bound: $P(A) \leq 1$ for any event $A$.
- Monotonicity: If $A \subseteq B$, then $P(A) \leq P(B)$. (A more specific event is at most as likely as a broader one.)
- Inclusion-Exclusion: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$.
Suppose you roll a fair die. What is the probability of getting at least one 6 in three rolls? Directly counting outcomes with at least one 6 is messy. Instead, use the complement:
In a class of 30 students, 18 study Python, 12 study R, and 5 study both. What fraction study at least one?
Without subtracting the overlap, you'd double-count the 5 students who do both.
4. Independence and Conditional Probability
Independence
Intuitively, two events are independent if knowing that one occurred tells you nothing about whether the other occurred. The formal definition translates this into a clean multiplicative rule.
Events $A$ and $B$ are independent if:
$A$ = "1st toss is H" and $B$ = "2nd toss is H". Each has probability $\frac{1}{2}$. Their intersection is $\{(H,H)\}$ with probability $\frac{1}{4}$.
So the two tosses are independent — knowing the first result doesn't help predict the second.
A common misconception: if two events are disjoint (can't both happen), they are actually strongly dependent (as long as both have positive probability). If $A \cap B = \varnothing$, then $P(A \cap B) = 0$, but $P(A)\times P(B) > 0$ — so the independence equation fails. Knowing $A$ occurred immediately tells you $B$ did not!
For more than two events, independence must hold for every subcollection:
A collection $\{A_i, i \in I\}$ is mutually independent if for every finite subset $J \subseteq I$:
Conditional Probability
When events are not independent, learning that one occurred changes your belief about the other. Conditional probability quantifies this updated belief.
For events $A$ and $B$ with $P(B) > 0$:
Intuition: Once we know $B$ happened, our "universe" shrinks from $\Omega$ to $B$. We rescale so that $B$ has total probability 1, and then ask what fraction of $B$ is also in $A$.
Roll a fair die. Let $A$ = "result is $\leq 3$" and $B$ = "result is even" = $\{2,4,6\}$.
Without any information, $P(A) = 1/2$. But once you know the result is even, only one of the three even numbers ($2$) is $\leq 3$, so the probability drops to $1/3$.
$A$ and $B$ are independent (with $P(B)>0$) if and only if $P(A\mid B) = P(A)$. In words: learning $B$ happened doesn't change your belief about $A$ — that's exactly what "independent" means.
Bayes' Theorem
Bayes' theorem lets you "reverse" a conditional probability. If you know $P(B \mid A_i)$ (how likely is the evidence given a cause), it tells you $P(A_i \mid B)$ (how likely is the cause given the evidence).
Let $A_1, \ldots, A_k$ be a partition of $\Omega$ (mutually exclusive, exhaustive) with $P(A_i)>0$ for all $i$. For any event $B$ with $P(B)>0$:
$P(A_i)$ is the prior (belief before seeing evidence). $P(A_i \mid B)$ is the posterior (updated belief after seeing $B$).
A disease affects 1% of a population. A test is 95% accurate: $P(\text{positive} \mid \text{disease}) = 0.95$ and $P(\text{positive} \mid \text{no disease}) = 0.05$ (5% false positive rate). You test positive — what's the probability you're actually sick?
Despite the test being "95% accurate," there's only about a 16% chance you actually have the disease! The low base rate (1%) means most positive results are false alarms. This is one of the most important and counter-intuitive consequences of Bayes' theorem.
5. Random Variables
In practice, we rarely work directly with sample outcomes $\omega$. Instead, we assign numerical values to outcomes. A random variable is the function that performs this assignment.
A random variable $X$ is a function:
It maps each outcome in the sample space to a real number.
Intuition: Think of a random variable as a "measurement" or "summary" of the random experiment. The experiment produces some outcome $\omega$; the random variable extracts a number from it.
Toss a coin 10 times. The outcome $\omega$ is a sequence like $(H,T,T,T,H,H,T,T,T,H)$. Let $X$ = "number of heads." Then $X(\omega) = 4$ for this particular outcome. The random variable collapses a complex outcome (a length-10 sequence) into a single informative number.
Roll two dice. The sample space has 36 outcomes: $\Omega = \{(i,j): i,j \in \{1,\ldots,6\}\}$. Define $X = i + j$ (the sum). Now $X$ takes values in $\{2, 3, \ldots, 12\}$. For instance, $X((3,5)) = 8$.
The lifetime of a lightbulb is a random variable $X \geq 0$. Here $\Omega$ is abstract (all possible "states of the world" that determine how long the bulb lasts), but we care about the numerical output: how many hours until failure.
6. Cumulative Distribution Function (CDF)
Given a random variable $X$, how do we fully describe its "randomness"? The CDF is the most general answer — it works for discrete, continuous, and mixed random variables alike.
The CDF of a random variable $X$ is:
It answers: "what is the probability that $X$ takes a value at most $x$?"
- Non-decreasing: If $a < b$, then $F_X(a) \leq F_X(b)$. (More values are $\leq b$ than $\leq a$.)
- Right-continuous: $\lim_{x \to a^+} F_X(x) = F_X(a)$.
- Boundary values: $\lim_{x \to -\infty} F_X(x) = 0$ and $\lim_{x \to \infty} F_X(x) = 1$.
Intuition: Imagine sliding a vertical line from left to right along the number line. $F_X(x)$ tells you how much of the "probability mass" you've swept up so far. It starts at 0, ends at 1, and never decreases.
7. PMF and PDF — Two Flavors of Distribution
The CDF is universal, but for calculations it's often easier to work with either a probability mass function (for discrete RVs) or a probability density function (for continuous RVs).
Discrete: Probability Mass Function (PMF)
If $X$ takes values in a countable set $\{x_1, x_2, \ldots\}$, its PMF is:
The PMF tells you the probability of each specific value. It satisfies $\sum_x f_X(x) = 1$.
The CDF of a discrete RV is a staircase function — it jumps at each value $x_i$ by an amount equal to $f_X(x_i)$:
Toss a fair coin 3 times. Let $X$ = number of heads. Then $X \sim \text{Binomial}(3, 1/2)$ and:
| $x$ | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| $f_X(x)$ | $1/8$ | $3/8$ | $3/8$ | $1/8$ |
| $F_X(x)$ | $1/8$ | $4/8$ | $7/8$ | $1$ |
The CDF jumps at $x = 0, 1, 2, 3$ and is flat between them.
Continuous: Probability Density Function (PDF)
If the CDF $F_X$ can be written as an integral of a non-negative function $f_X$:
then $f_X$ is the probability density function. It satisfies $\int_{-\infty}^{\infty} f_X(x)\,dx = 1$.
For continuous RVs, $P(X = x) = 0$ for any single point $x$. The PDF $f_X(x)$ is not a probability — it's a density. To get a probability, you must integrate over an interval: $P(a < X \leq b) = \int_a^b f_X(x)\,dx$. Think of $f_X(x)\,dx$ as the probability of $X$ falling in a tiny interval of width $dx$ around $x$.
$X \sim \text{Uniform}(0,1)$ has the simplest continuous distribution:
Every sub-interval of $[0,1]$ of the same length has the same probability. For example, $P(0.2 < X < 0.5) = 0.5 - 0.2 = 0.3$.
Lightbulb lifetime might follow $X \sim \text{Exp}(\lambda)$ with $f_X(x) = \lambda e^{-\lambda x}$ for $x \geq 0$. If $\lambda = 1$:
There's about a 13.5% chance the bulb lasts more than 2 units of time.
8. Moments — Summarizing a Distribution
A full distribution (CDF, PMF, or PDF) contains complete information about a random variable, but is often more detail than we need. Moments compress this information into a few key numbers.
Expectation (Mean)
The expected value (or mean) of $X$ is:
Intuition: $E(X)$ is the "center of mass" of the distribution — the long-run average if you repeated the experiment infinitely many times.
A crucial property is linearity: for any constants $a_1, \ldots, a_p$ and random variables $X_1, \ldots, X_p$:
This holds always, even when the $X_i$ are dependent.
Variance and Standard Deviation
The variance measures spread around the mean:
The standard deviation is $\sigma_X = \sqrt{\text{Var}(X)}$, which has the same units as $X$.
Let $X$ = result of rolling a fair die. Then:
Covariance and Correlation
For two random variables $X$ and $Y$:
The correlation normalizes covariance to $[-1, 1]$:
$\rho = +1$ means perfect positive linear relationship, $\rho = -1$ means perfect negative, $\rho = 0$ means no linear relationship.
If $X$ and $Y$ are independent, then $E(XY) = E(X)E(Y)$, so $\text{Cov}(X,Y) = 0$. The converse is false — zero covariance does not imply independence (it only rules out linear dependence).
The Covariance Matrix
For a random vector $\mathbf{X} = (X_1, \ldots, X_k)^\top$, the mean vector is $\boldsymbol{\mu} = (E(X_1), \ldots, E(X_k))^\top$ and the covariance matrix is:
The $(i,j)$-entry is $\text{Cov}(X_i, X_j)$, so the diagonal entries are variances and off-diagonal entries are covariances. A useful rule: if $\mathbf{Y} = A\mathbf{X}$ for a matrix $A$, then $\text{Var}(\mathbf{Y}) = A\,\Sigma_{\mathbf{X}}\,A^\top$.
9. Conditional Expectation and Variance
Just as we conditioned probabilities on events, we can condition expectations and variances on the value of another random variable. This is central to prediction and regression.
$E(X \mid Y = y)$ is the mean of the conditional distribution of $X$ given $Y = y$:
Viewed as a function of $Y$, $E(X \mid Y)$ is itself a random variable.
Suppose study hours $Y$ and test score $X$ are jointly distributed. $E(X \mid Y = 5)$ answers: "if a student studies 5 hours, what score do we expect on average?" As $Y$ varies, $E(X \mid Y)$ traces out the regression function — the best prediction of $X$ given $Y$.
Two Powerful Decomposition Laws
Intuition: Subdivide the population by $Y$. Compute the average $X$ within each group. Then average those group averages (weighted by group size). You get back the overall average of $X$.
Total variance = average within-group variance + between-group variance.
$X$ = a person's height, $Y$ = their country. $E[\text{Var}(X \mid Y)]$ captures how much height varies within each country. $\text{Var}(E(X \mid Y))$ captures how much average heights differ between countries. Together they explain all the variation in height.
10. Key Probability Inequalities
Often we can't compute exact probabilities, but we can bound them using just the mean or variance. These bounds are essential tools in theoretical statistics.
If $X \geq 0$ and $t > 0$:
Intuition: A non-negative variable can't be "large" too often if its average is small. If the average income is €30k, at most $1/3$ of people earn ≥€90k.
For any random variable $X$ with mean $\mu$ and variance $\sigma^2$:
Intuition: A low-variance distribution concentrates near its mean. Setting $t = k\sigma$: at most $1/k^2$ of the probability lies more than $k$ standard deviations from the mean.
If $E(X) = 100$ and $\text{Var}(X) = 25$ (so $\sigma = 5$), then:
At most about 11% of the probability is more than 3 standard deviations from the mean — regardless of the distribution's shape.
If $g$ is a convex function: $E(g(X)) \geq g(E(X))$.
Memory aid: Take $g(x) = x^2$. Then $E(X^2) \geq (E(X))^2$, which is the same as saying $\text{Var}(X) \geq 0$. That's reassuring and easy to remember!
If $g$ is concave, the inequality reverses: $E(g(X)) \leq g(E(X))$.
11. Convergence of Random Sequences
In statistics, we often study what happens as the sample size $n \to \infty$. But "a sequence of random variables converges" can mean several different things. Here are the three main notions.
$X_n \xrightarrow{P} X$ if for every $\epsilon > 0$: $\lim_{n \to \infty} P(|X_n - X| > \epsilon) = 0$.
Intuition: For large $n$, $X_n$ is "close" to $X$ with high probability. There might be rare exceptions, but they become vanishingly unlikely.
$X_n \xrightarrow{d} X$ if $\lim_{n \to \infty} F_{X_n}(x) = F_X(x)$ at every continuity point $x$ of $F_X$.
Intuition: The shape of the distribution of $X_n$ approaches the shape of the distribution of $X$. The individual random variables $X_n$ and $X$ need not be related — only their distributions must match in the limit.
$X_n \xrightarrow{L^r} X$ (for $r \geq 1$) if $\lim_{n \to \infty} E(|X_n - X|^r) = 0$.
When $r = 2$, this is called convergence in quadratic mean.
How They Relate
12. The Big Theorems — LLN and CLT
These two results are the workhorses of statistics. They explain why sample averages behave so well and why the normal distribution appears everywhere.
Let $X_1, X_2, \ldots$ be i.i.d. with $E(X_i) = \mu$. Then:
Intuition: The sample average stabilizes around the true mean as you collect more data. This is why polls, experiments, and simulations work — with enough observations, the average converges to the quantity you're trying to measure.
A casino game has a 1% house edge. On any given bet, the house might lose. But averaged over millions of bets, the casino's profit per bet converges to 1% of the wager — the LLN guarantees it.
Let $X_1, X_2, \ldots$ be i.i.d. with $E(X_i) = \mu$ and $\text{Var}(X_i) = \sigma^2 > 0$. Then:
Intuition: No matter what distribution the individual $X_i$ come from (Bernoulli, exponential, Poisson, anything with finite variance), the standardized sample average is approximately normal for large $n$. This is why the bell curve appears so often in practice.
Roll a fair die $n = 100$ times and compute $\bar{X}_{100}$. Each $X_i$ is uniform on $\{1,\ldots,6\}$ — decidedly not normal. Yet the CLT tells us:
So $\bar{X}_{100}$ is approximately normal with mean 3.5 and standard deviation $\approx 0.54$. The probability that $\bar{X}_{100}$ falls between 3.0 and 4.0 is roughly 95% — even though each individual die roll is far from normal.
Key Takeaways
- Sample space $\Omega$ defines what can happen; events are subsets we assign probabilities to; the probability measure $P$ follows three axioms from which everything else is derived.
- Independence means $P(A \cap B) = P(A)P(B)$ — knowing one event tells you nothing about the other. Disjoint events are the opposite of independent.
- Bayes' theorem reverses conditional probabilities and is the foundation of Bayesian reasoning — posterior $\propto$ likelihood $\times$ prior.
- Random variables map outcomes to numbers; their behavior is fully described by the CDF, and more conveniently by the PMF (discrete) or PDF (continuous).
- Moments (mean, variance, covariance) compress distributional information into key summaries; inequalities (Markov, Chebyshev, Jensen) let us bound probabilities using only these summaries.
- LLN: sample means converge to population means. CLT: sample means are approximately normal for large $n$. Together, these justify most of classical statistics.