Lecture 1: Sample Space, Random Variables, and Distributions

Course: Core Concepts in Statistical Learning Lecture 1

This lecture lays the probabilistic groundwork that everything in statistical learning builds upon. We start from the very beginning — what is randomness? — and build up to random variables, their distributions, and the key tools we need to describe and work with uncertain quantities. Think of this lecture as assembling a toolkit: sample spaces, events, probabilities, random variables, CDFs, and density/mass functions are the core instruments you'll reach for in every later topic.

1. The Sample Space

Before we can talk about probabilities, we need to define what can happen. The sample space $\Omega$ is simply the set of all possible outcomes of a random experiment. Every probabilistic statement we ever make is, at its core, a statement about subsets of $\Omega$.

Definition — Sample Space

The sample space $\Omega$ is the set of all possible outcomes of a random experiment. Individual elements $\omega \in \Omega$ are called sample outcomes (or realizations, or elementary events).

Sample spaces come in different "sizes":

Example — Coin Tossed Twice

Toss a coin twice. Each toss yields Heads (H) or Tails (T). The sample space is:

$$\Omega = \{(H,H),\;(H,T),\;(T,H),\;(T,T)\}$$

This is a finite sample space with $|\Omega| = 4$ outcomes.

Example — Rolling a Die

Roll a standard six-sided die once:

$$\Omega = \{1, 2, 3, 4, 5, 6\}$$

Again finite, with 6 outcomes.

Example — Continuous Measurement

Measure the temperature outside right now. In principle the result could be any real number, so $\Omega = \mathbb{R}$ (uncountable). In practice you might restrict to $\Omega = [-50, 60]$ degrees Celsius, but mathematically we treat it as a continuum.

2. Events

An event is any subset of $\Omega$. Intuitively, an event is a "question" you can ask about the outcome of the experiment — and the answer is either "yes, it happened" or "no, it didn't."

Definition — Event

A subset $A \subseteq \Omega$ is called an event. We say the event $A$ occurred if the observed outcome $\omega$ belongs to $A$.

Since events are sets, all the usual set operations apply:

Example — Events in the Double Coin Toss

Let $A$ = "the first toss is H". Then:

$$A = \{(H,H),\;(H,T)\}, \qquad A^c = \{(T,H),\;(T,T)\}$$

Let $B$ = "at least one toss is T". Then $B = \{(H,T),(T,H),(T,T)\}$ and $A \cap B = \{(H,T)\}$ — the first toss was H and at least one toss was T.

Example — Shrinking Intervals

Take $\Omega = \mathbb{R}$ and define $A_i = [0, \tfrac{1}{i})$ for $i = 1, 2, 3, \ldots$. Each subsequent interval is smaller: $A_1 = [0,1)$, $A_2 = [0,\tfrac{1}{2})$, $A_3 = [0,\tfrac{1}{3})$, and so on. Then:

$$\bigcup_{i=1}^{\infty} A_i = [0, 1), \qquad \bigcap_{i=1}^{\infty} A_i = \{0\}$$

The union is the largest interval $A_1$. The intersection shrinks to the single point that belongs to every interval — namely $0$.

3. Probability Measure

We now want to assign a number to each event that captures "how likely" it is. A probability measure $P$ does exactly this — it's a function that takes an event and returns a number between 0 and 1, following three intuitive rules.

Definition — Probability Measure (Kolmogorov Axioms)

A probability measure $P$ is a function on events satisfying:

  1. Non-negativity: $P(A) \geq 0$ for every event $A$.
  2. Normalization: $P(\Omega) = 1$ — something always happens.
  3. $\sigma$-Additivity: If $A_1, A_2, \ldots$ are disjoint events (no two can happen simultaneously), then $P\!\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i)$.

Intuition: Axiom 3 is the powerful one. It says that if events don't overlap, the probability of "any of them happening" is just the sum of their individual probabilities. This is the foundation for most probability calculations.

Properties That Follow

From these three axioms alone, we can derive several useful facts:

Key Properties
  • Impossible event: $P(\varnothing) = 0$.
  • Complement rule: $P(A^c) = 1 - P(A)$. (Often the easiest way to compute a probability — find the probability of the opposite!)
  • Bound: $P(A) \leq 1$ for any event $A$.
  • Monotonicity: If $A \subseteq B$, then $P(A) \leq P(B)$. (A more specific event is at most as likely as a broader one.)
  • Inclusion-Exclusion: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$.
Example — Complement Rule in Practice

Suppose you roll a fair die. What is the probability of getting at least one 6 in three rolls? Directly counting outcomes with at least one 6 is messy. Instead, use the complement:

$$P(\text{at least one 6}) = 1 - P(\text{no 6 in 3 rolls}) = 1 - \left(\frac{5}{6}\right)^3 = 1 - \frac{125}{216} \approx 0.421$$
Example — Inclusion-Exclusion

In a class of 30 students, 18 study Python, 12 study R, and 5 study both. What fraction study at least one?

$$P(\text{Python} \cup \text{R}) = \frac{18}{30} + \frac{12}{30} - \frac{5}{30} = \frac{25}{30} \approx 0.833$$

Without subtracting the overlap, you'd double-count the 5 students who do both.

4. Independence and Conditional Probability

Independence

Intuitively, two events are independent if knowing that one occurred tells you nothing about whether the other occurred. The formal definition translates this into a clean multiplicative rule.

Definition — Independence of Two Events

Events $A$ and $B$ are independent if:

$$P(A \cap B) = P(A) \times P(B)$$
Example — Two Fair Coin Tosses

$A$ = "1st toss is H" and $B$ = "2nd toss is H". Each has probability $\frac{1}{2}$. Their intersection is $\{(H,H)\}$ with probability $\frac{1}{4}$.

$$P(A \cap B) = \frac{1}{4} = \frac{1}{2} \times \frac{1}{2} = P(A) \times P(B) \quad \checkmark$$

So the two tosses are independent — knowing the first result doesn't help predict the second.

Important — Disjoint ≠ Independent

A common misconception: if two events are disjoint (can't both happen), they are actually strongly dependent (as long as both have positive probability). If $A \cap B = \varnothing$, then $P(A \cap B) = 0$, but $P(A)\times P(B) > 0$ — so the independence equation fails. Knowing $A$ occurred immediately tells you $B$ did not!

For more than two events, independence must hold for every subcollection:

Definition — Mutual Independence

A collection $\{A_i, i \in I\}$ is mutually independent if for every finite subset $J \subseteq I$:

$$P\!\left(\bigcap_{j \in J} A_j\right) = \prod_{j \in J} P(A_j)$$

Conditional Probability

When events are not independent, learning that one occurred changes your belief about the other. Conditional probability quantifies this updated belief.

Definition — Conditional Probability

For events $A$ and $B$ with $P(B) > 0$:

$$P(A \mid B) = \frac{P(A \cap B)}{P(B)}$$

Intuition: Once we know $B$ happened, our "universe" shrinks from $\Omega$ to $B$. We rescale so that $B$ has total probability 1, and then ask what fraction of $B$ is also in $A$.

Example — Conditional Probability with a Die

Roll a fair die. Let $A$ = "result is $\leq 3$" and $B$ = "result is even" = $\{2,4,6\}$.

$$P(A \mid B) = \frac{P(A \cap B)}{P(B)} = \frac{P(\{2\})}{P(\{2,4,6\})} = \frac{1/6}{3/6} = \frac{1}{3}$$

Without any information, $P(A) = 1/2$. But once you know the result is even, only one of the three even numbers ($2$) is $\leq 3$, so the probability drops to $1/3$.

Link Between Independence and Conditioning

$A$ and $B$ are independent (with $P(B)>0$) if and only if $P(A\mid B) = P(A)$. In words: learning $B$ happened doesn't change your belief about $A$ — that's exactly what "independent" means.

Bayes' Theorem

Bayes' theorem lets you "reverse" a conditional probability. If you know $P(B \mid A_i)$ (how likely is the evidence given a cause), it tells you $P(A_i \mid B)$ (how likely is the cause given the evidence).

Theorem — Bayes' Theorem

Let $A_1, \ldots, A_k$ be a partition of $\Omega$ (mutually exclusive, exhaustive) with $P(A_i)>0$ for all $i$. For any event $B$ with $P(B)>0$:

$$P(A_i \mid B) = \frac{P(B \mid A_i)\,P(A_i)}{\sum_{j=1}^{k} P(B \mid A_j)\,P(A_j)}$$

$P(A_i)$ is the prior (belief before seeing evidence). $P(A_i \mid B)$ is the posterior (updated belief after seeing $B$).

Example — Medical Test

A disease affects 1% of a population. A test is 95% accurate: $P(\text{positive} \mid \text{disease}) = 0.95$ and $P(\text{positive} \mid \text{no disease}) = 0.05$ (5% false positive rate). You test positive — what's the probability you're actually sick?

$$P(\text{disease} \mid \text{+}) = \frac{0.95 \times 0.01}{0.95 \times 0.01 + 0.05 \times 0.99} = \frac{0.0095}{0.0095 + 0.0495} \approx 0.161$$

Despite the test being "95% accurate," there's only about a 16% chance you actually have the disease! The low base rate (1%) means most positive results are false alarms. This is one of the most important and counter-intuitive consequences of Bayes' theorem.

5. Random Variables

In practice, we rarely work directly with sample outcomes $\omega$. Instead, we assign numerical values to outcomes. A random variable is the function that performs this assignment.

Definition — Random Variable

A random variable $X$ is a function:

$$X : \Omega \to \mathbb{R}, \qquad \omega \mapsto X(\omega)$$

It maps each outcome in the sample space to a real number.

Intuition: Think of a random variable as a "measurement" or "summary" of the random experiment. The experiment produces some outcome $\omega$; the random variable extracts a number from it.

Example — Counting Heads

Toss a coin 10 times. The outcome $\omega$ is a sequence like $(H,T,T,T,H,H,T,T,T,H)$. Let $X$ = "number of heads." Then $X(\omega) = 4$ for this particular outcome. The random variable collapses a complex outcome (a length-10 sequence) into a single informative number.

Example — Sum of Two Dice

Roll two dice. The sample space has 36 outcomes: $\Omega = \{(i,j): i,j \in \{1,\ldots,6\}\}$. Define $X = i + j$ (the sum). Now $X$ takes values in $\{2, 3, \ldots, 12\}$. For instance, $X((3,5)) = 8$.

Example — Continuous Random Variable

The lifetime of a lightbulb is a random variable $X \geq 0$. Here $\Omega$ is abstract (all possible "states of the world" that determine how long the bulb lasts), but we care about the numerical output: how many hours until failure.

6. Cumulative Distribution Function (CDF)

Given a random variable $X$, how do we fully describe its "randomness"? The CDF is the most general answer — it works for discrete, continuous, and mixed random variables alike.

Definition — Cumulative Distribution Function

The CDF of a random variable $X$ is:

$$F_X(x) = P(X \leq x), \qquad x \in \mathbb{R}$$

It answers: "what is the probability that $X$ takes a value at most $x$?"

Properties of Any CDF
  • Non-decreasing: If $a < b$, then $F_X(a) \leq F_X(b)$. (More values are $\leq b$ than $\leq a$.)
  • Right-continuous: $\lim_{x \to a^+} F_X(x) = F_X(a)$.
  • Boundary values: $\lim_{x \to -\infty} F_X(x) = 0$ and $\lim_{x \to \infty} F_X(x) = 1$.

Intuition: Imagine sliding a vertical line from left to right along the number line. $F_X(x)$ tells you how much of the "probability mass" you've swept up so far. It starts at 0, ends at 1, and never decreases.

0 0.5 1 x F(x) Continuous CDF: smooth, S-shaped curve
Figure 1: A typical continuous CDF. It rises smoothly from 0 to 1.

7. PMF and PDF — Two Flavors of Distribution

The CDF is universal, but for calculations it's often easier to work with either a probability mass function (for discrete RVs) or a probability density function (for continuous RVs).

Discrete: Probability Mass Function (PMF)

Definition — PMF

If $X$ takes values in a countable set $\{x_1, x_2, \ldots\}$, its PMF is:

$$f_X(x) = P(X = x)$$

The PMF tells you the probability of each specific value. It satisfies $\sum_x f_X(x) = 1$.

The CDF of a discrete RV is a staircase function — it jumps at each value $x_i$ by an amount equal to $f_X(x_i)$:

$$F_X(x) = \sum_{x_i \leq x} f_X(x_i)$$
Example — Binomial(3, 1/2)

Toss a fair coin 3 times. Let $X$ = number of heads. Then $X \sim \text{Binomial}(3, 1/2)$ and:

$$f_X(x) = \binom{3}{x}\left(\frac{1}{2}\right)^3 = \frac{1}{8}\binom{3}{x}, \quad x \in \{0,1,2,3\}$$
$x$0123
$f_X(x)$$1/8$$3/8$$3/8$$1/8$
$F_X(x)$$1/8$$4/8$$7/8$$1$

The CDF jumps at $x = 0, 1, 2, 3$ and is flat between them.

Continuous: Probability Density Function (PDF)

Definition — PDF

If the CDF $F_X$ can be written as an integral of a non-negative function $f_X$:

$$F_X(x) = \int_{-\infty}^{x} f_X(t)\,dt$$

then $f_X$ is the probability density function. It satisfies $\int_{-\infty}^{\infty} f_X(x)\,dx = 1$.

Important — Density ≠ Probability

For continuous RVs, $P(X = x) = 0$ for any single point $x$. The PDF $f_X(x)$ is not a probability — it's a density. To get a probability, you must integrate over an interval: $P(a < X \leq b) = \int_a^b f_X(x)\,dx$. Think of $f_X(x)\,dx$ as the probability of $X$ falling in a tiny interval of width $dx$ around $x$.

Example — Uniform(0,1)

$X \sim \text{Uniform}(0,1)$ has the simplest continuous distribution:

$$f_X(x) = \begin{cases} 1 & \text{if } 0 \leq x \leq 1 \\ 0 & \text{otherwise}\end{cases} \qquad F_X(x) = \begin{cases} 0 & \text{if } x < 0 \\ x & \text{if } 0 \leq x < 1 \\ 1 & \text{if } x \geq 1\end{cases}$$

Every sub-interval of $[0,1]$ of the same length has the same probability. For example, $P(0.2 < X < 0.5) = 0.5 - 0.2 = 0.3$.

Example — Exponential Distribution

Lightbulb lifetime might follow $X \sim \text{Exp}(\lambda)$ with $f_X(x) = \lambda e^{-\lambda x}$ for $x \geq 0$. If $\lambda = 1$:

$$P(X > 2) = \int_2^{\infty} e^{-x}\,dx = e^{-2} \approx 0.135$$

There's about a 13.5% chance the bulb lasts more than 2 units of time.

8. Moments — Summarizing a Distribution

A full distribution (CDF, PMF, or PDF) contains complete information about a random variable, but is often more detail than we need. Moments compress this information into a few key numbers.

Expectation (Mean)

Definition — Expected Value

The expected value (or mean) of $X$ is:

$$E(X) = \int_{-\infty}^{\infty} x\,dF_X(x) = \begin{cases} \displaystyle\sum_x x\,f_X(x) & \text{(discrete)} \\[6pt] \displaystyle\int_{-\infty}^{\infty} x\,f_X(x)\,dx & \text{(continuous)}\end{cases}$$

Intuition: $E(X)$ is the "center of mass" of the distribution — the long-run average if you repeated the experiment infinitely many times.

A crucial property is linearity: for any constants $a_1, \ldots, a_p$ and random variables $X_1, \ldots, X_p$:

$$E\!\left(\sum_{i=1}^{p} a_i X_i\right) = \sum_{i=1}^{p} a_i\,E(X_i)$$

This holds always, even when the $X_i$ are dependent.

Variance and Standard Deviation

Definition — Variance

The variance measures spread around the mean:

$$\text{Var}(X) = E\!\left[(X - E(X))^2\right] = E(X^2) - [E(X)]^2$$

The standard deviation is $\sigma_X = \sqrt{\text{Var}(X)}$, which has the same units as $X$.

Example — Fair Die

Let $X$ = result of rolling a fair die. Then:

$$E(X) = \frac{1+2+3+4+5+6}{6} = 3.5$$
$$E(X^2) = \frac{1+4+9+16+25+36}{6} = \frac{91}{6} \approx 15.17$$
$$\text{Var}(X) = \frac{91}{6} - 3.5^2 = \frac{91}{6} - \frac{49}{4} = \frac{35}{12} \approx 2.92$$

Covariance and Correlation

Definition — Covariance and Correlation

For two random variables $X$ and $Y$:

$$\text{Cov}(X,Y) = E[(X - E(X))(Y - E(Y))] = E(XY) - E(X)E(Y)$$

The correlation normalizes covariance to $[-1, 1]$:

$$\rho_{X,Y} = \frac{\text{Cov}(X,Y)}{\sigma_X \,\sigma_Y}$$

$\rho = +1$ means perfect positive linear relationship, $\rho = -1$ means perfect negative, $\rho = 0$ means no linear relationship.

Key Fact — Independence Implies Zero Covariance

If $X$ and $Y$ are independent, then $E(XY) = E(X)E(Y)$, so $\text{Cov}(X,Y) = 0$. The converse is false — zero covariance does not imply independence (it only rules out linear dependence).

The Covariance Matrix

For a random vector $\mathbf{X} = (X_1, \ldots, X_k)^\top$, the mean vector is $\boldsymbol{\mu} = (E(X_1), \ldots, E(X_k))^\top$ and the covariance matrix is:

$$\Sigma_{\mathbf{X}} = E[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^\top]$$

The $(i,j)$-entry is $\text{Cov}(X_i, X_j)$, so the diagonal entries are variances and off-diagonal entries are covariances. A useful rule: if $\mathbf{Y} = A\mathbf{X}$ for a matrix $A$, then $\text{Var}(\mathbf{Y}) = A\,\Sigma_{\mathbf{X}}\,A^\top$.

9. Conditional Expectation and Variance

Just as we conditioned probabilities on events, we can condition expectations and variances on the value of another random variable. This is central to prediction and regression.

Definition — Conditional Expectation

$E(X \mid Y = y)$ is the mean of the conditional distribution of $X$ given $Y = y$:

$$E(X \mid Y = y) = \begin{cases} \displaystyle\sum_x x\,f_{X|Y}(x \mid y) & \text{(discrete)} \\[6pt] \displaystyle\int x\,f_{X|Y}(x \mid y)\,dx & \text{(continuous)}\end{cases}$$

Viewed as a function of $Y$, $E(X \mid Y)$ is itself a random variable.

Example — Predicting Test Scores

Suppose study hours $Y$ and test score $X$ are jointly distributed. $E(X \mid Y = 5)$ answers: "if a student studies 5 hours, what score do we expect on average?" As $Y$ varies, $E(X \mid Y)$ traces out the regression function — the best prediction of $X$ given $Y$.

Two Powerful Decomposition Laws

Theorem — Tower Rule (Law of Iterated Expectations)
$$E\big[E(X \mid Y)\big] = E(X)$$

Intuition: Subdivide the population by $Y$. Compute the average $X$ within each group. Then average those group averages (weighted by group size). You get back the overall average of $X$.

Theorem — Law of Total Variance
$$\text{Var}(X) = E[\text{Var}(X \mid Y)] + \text{Var}(E(X \mid Y))$$

Total variance = average within-group variance + between-group variance.

Example — Heights by Country

$X$ = a person's height, $Y$ = their country. $E[\text{Var}(X \mid Y)]$ captures how much height varies within each country. $\text{Var}(E(X \mid Y))$ captures how much average heights differ between countries. Together they explain all the variation in height.

10. Key Probability Inequalities

Often we can't compute exact probabilities, but we can bound them using just the mean or variance. These bounds are essential tools in theoretical statistics.

Markov's Inequality

If $X \geq 0$ and $t > 0$:

$$P(X \geq t) \leq \frac{E(X)}{t}$$

Intuition: A non-negative variable can't be "large" too often if its average is small. If the average income is €30k, at most $1/3$ of people earn ≥€90k.

Chebyshev's Inequality

For any random variable $X$ with mean $\mu$ and variance $\sigma^2$:

$$P(|X - \mu| \geq t) \leq \frac{\sigma^2}{t^2}$$

Intuition: A low-variance distribution concentrates near its mean. Setting $t = k\sigma$: at most $1/k^2$ of the probability lies more than $k$ standard deviations from the mean.

Example — Chebyshev Applied

If $E(X) = 100$ and $\text{Var}(X) = 25$ (so $\sigma = 5$), then:

$$P(|X - 100| \geq 15) \leq \frac{25}{225} = \frac{1}{9} \approx 0.111$$

At most about 11% of the probability is more than 3 standard deviations from the mean — regardless of the distribution's shape.

Jensen's Inequality

If $g$ is a convex function: $E(g(X)) \geq g(E(X))$.

Memory aid: Take $g(x) = x^2$. Then $E(X^2) \geq (E(X))^2$, which is the same as saying $\text{Var}(X) \geq 0$. That's reassuring and easy to remember!

If $g$ is concave, the inequality reverses: $E(g(X)) \leq g(E(X))$.

11. Convergence of Random Sequences

In statistics, we often study what happens as the sample size $n \to \infty$. But "a sequence of random variables converges" can mean several different things. Here are the three main notions.

Convergence in Probability

$X_n \xrightarrow{P} X$ if for every $\epsilon > 0$: $\lim_{n \to \infty} P(|X_n - X| > \epsilon) = 0$.

Intuition: For large $n$, $X_n$ is "close" to $X$ with high probability. There might be rare exceptions, but they become vanishingly unlikely.

Convergence in Distribution

$X_n \xrightarrow{d} X$ if $\lim_{n \to \infty} F_{X_n}(x) = F_X(x)$ at every continuity point $x$ of $F_X$.

Intuition: The shape of the distribution of $X_n$ approaches the shape of the distribution of $X$. The individual random variables $X_n$ and $X$ need not be related — only their distributions must match in the limit.

Convergence in $L^r$

$X_n \xrightarrow{L^r} X$ (for $r \geq 1$) if $\lim_{n \to \infty} E(|X_n - X|^r) = 0$.

When $r = 2$, this is called convergence in quadratic mean.

How They Relate

Lr Probability Distribution implies implies If limit is a constant c, dist. ⟹ probability
Figure 2: Hierarchy of convergence modes. Stronger notions imply weaker ones.

12. The Big Theorems — LLN and CLT

These two results are the workhorses of statistics. They explain why sample averages behave so well and why the normal distribution appears everywhere.

Theorem — Weak Law of Large Numbers (WLLN)

Let $X_1, X_2, \ldots$ be i.i.d. with $E(X_i) = \mu$. Then:

$$\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{P} \mu$$

Intuition: The sample average stabilizes around the true mean as you collect more data. This is why polls, experiments, and simulations work — with enough observations, the average converges to the quantity you're trying to measure.

Example — Casino Law of Large Numbers

A casino game has a 1% house edge. On any given bet, the house might lose. But averaged over millions of bets, the casino's profit per bet converges to 1% of the wager — the LLN guarantees it.

Theorem — Central Limit Theorem (CLT)

Let $X_1, X_2, \ldots$ be i.i.d. with $E(X_i) = \mu$ and $\text{Var}(X_i) = \sigma^2 > 0$. Then:

$$\frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \xrightarrow{d} N(0, 1)$$

Intuition: No matter what distribution the individual $X_i$ come from (Bernoulli, exponential, Poisson, anything with finite variance), the standardized sample average is approximately normal for large $n$. This is why the bell curve appears so often in practice.

Example — CLT in Action

Roll a fair die $n = 100$ times and compute $\bar{X}_{100}$. Each $X_i$ is uniform on $\{1,\ldots,6\}$ — decidedly not normal. Yet the CLT tells us:

$$\bar{X}_{100} \approx N\!\left(3.5, \;\frac{35/12}{100}\right) = N(3.5, \; 0.292)$$

So $\bar{X}_{100}$ is approximately normal with mean 3.5 and standard deviation $\approx 0.54$. The probability that $\bar{X}_{100}$ falls between 3.0 and 4.0 is roughly 95% — even though each individual die roll is far from normal.

Key Takeaways