This lecture builds on the foundation of random variables and CDFs from Lecture 1. We begin with the quantile function — the "inverse" of the CDF — which lets us go from probabilities back to values. We then extend our view to multivariate distributions, covering joint, marginal, and conditional distributions, and the key concept of independence for random variables. Next we formalize moments (mean, variance, covariance) and their properties, including the powerful tools of conditional expectation and the law of total variance. We close with probability inequalities and the major convergence theorems (LLN, CLT, delta method) that underpin all of statistical inference.
1. The Quantile Function
Given a CDF $F$, the quantile function is defined as:
The value $F^{-1}(q)$ is called the $q$-quantile.
The CDF answers: "given a value $x$, what is the probability of being at most $x$?" The quantile function asks the reverse question: "given a probability $q$, what is the smallest value $x$ such that at least a fraction $q$ of the distribution lies at or below $x$?"
Think of it like a class exam. The CDF tells you "if you scored 75, you're in the 80th percentile." The quantile function tells you "to be in the 80th percentile, you need at least a score of 75."
When is the quantile just the usual inverse?
If $F$ is strictly increasing and continuous, then $F$ is a bijection (one-to-one and onto) from its support to $(0,1)$, and $F^{-1}(q)$ is simply the ordinary inverse function evaluated at $q$. The "inf" in the general definition handles cases where $F$ has flat regions or jumps (as with discrete distributions).
Let $X \sim \text{Uniform}(0,1)$, so $F(x) = x$ for $x \in [0,1]$. Then:
The 0.25-quantile is 0.25, the median ($q = 0.5$) is 0.5, and the 0.9-quantile is 0.9. This makes sense: for a uniform distribution, the quantiles are spread evenly across the interval.
Recall from Lecture 1 the $\text{Binomial}(3, 1/2)$ distribution with CDF:
The median is $F^{-1}(1/2) = \inf\{x : F(x) \ge 1/2\}$. Looking at the CDF, $F(x)$ first reaches $1/2$ at $x = 1$. So the median is $\mathbf{1}$.
Similarly, $F^{-1}(0.9) = \inf\{x : F(x) \ge 0.9\} = 3$, since $F(2) = 7/8 = 0.875 < 0.9$ but $F(3) = 1 \ge 0.9$.
Let $X \sim \text{Exp}(\lambda)$ with $F(x) = 1 - e^{-\lambda x}$ for $x \ge 0$. Since $F$ is continuous and strictly increasing on $[0, \infty)$, we solve $F(x) = q$ directly:
For $\lambda = 1$: the median is $-\ln(0.5) \approx 0.693$, and the 90th percentile is $-\ln(0.1) \approx 2.303$.
The median is the "middle value" of a distribution — half the probability mass lies below it, half above. For symmetric distributions (like the Normal), the median equals the mean. For skewed distributions, they can differ substantially.
2. Multivariate Distributions
So far we've dealt with a single random variable $X$. In practice, we often observe multiple quantities simultaneously — for instance, a person's height and weight, or today's temperature and humidity. We need tools to describe their joint behavior.
$\mathbf{X} = (X_1, \ldots, X_k)^\top$ is a random vector in $\mathbb{R}^k$ if $X_1, \ldots, X_k$ are all real random variables defined on the same sample space $\Omega$.
2.1 Joint CDF, PMF, and PDF
The joint CDF of $\mathbf{X}$ is:
Just as for a single random variable, we distinguish two important cases:
- Discrete: The joint pmf is $f_{\mathbf{X}}(\mathbf{x}) = P(X_1 = x_1, \ldots, X_k = x_k)$.
- Absolutely continuous: There exists a joint pdf $f_{\mathbf{X}}(\mathbf{x}) \ge 0$ such that $F_{\mathbf{X}}(\mathbf{x}) = \int_{-\infty}^{x_k} \cdots \int_{-\infty}^{x_1} f_{\mathbf{X}}(t_1, \ldots, t_k)\,dt_1 \cdots dt_k$.
Roll two fair dice. Let $X_1$ = result of die 1 and $X_2$ = result of die 2. The joint pmf is:
Every pair is equally likely. For example, $P(X_1 = 3, X_2 = 5) = 1/36$.
2.2 Marginal Distributions
The marginal distribution of a single component $X_i$ is obtained by "summing out" (discrete) or "integrating out" (continuous) all other variables from the joint distribution:
Think of the joint distribution as a complete spreadsheet of data about every variable together. The marginal distribution is what you get when you focus on just one column and ignore the rest. The word "marginal" comes from the idea of writing row/column totals in the margins of a table.
Suppose $(X, Y)$ has joint pmf given by this table:
| $Y=0$ | $Y=1$ | Marginal of $X$ | |
|---|---|---|---|
| $X=0$ | 0.2 | 0.1 | 0.3 |
| $X=1$ | 0.3 | 0.4 | 0.7 |
| Marginal of $Y$ | 0.5 | 0.5 | 1.0 |
For instance, $f_X(1) = f(1,0) + f(1,1) = 0.3 + 0.4 = 0.7$. The marginals appear in the "margins" of the table — hence the name.
2.3 Independence of Random Variables
Random variables $X_1, \ldots, X_k$ are mutually independent if for all subsets of indices and all events:
$X_1, \ldots, X_k$ are mutually independent if and only if for all $\mathbf{x} \in \mathbb{R}^k$:
In words: the joint pmf/pdf factors into the product of the marginals.
Independence means knowing one variable tells you nothing about the others. If the joint distribution equals the product of the individual (marginal) distributions, then the variables carry no information about each other. This is the multivariate generalization of the event independence $P(A \cap B) = P(A)P(B)$ from Lecture 1.
Using Example 5: is $P(X=1, Y=0) = P(X=1) \cdot P(Y=0)$? We have $0.3 \ne 0.7 \times 0.5 = 0.35$. So $X$ and $Y$ are not independent.
If instead the table were $f(0,0)=0.15$, $f(0,1)=0.15$, $f(1,0)=0.35$, $f(1,1)=0.35$, then every cell would equal the product of its row and column marginals, and $X, Y$ would be independent.
2.4 Conditional Distributions
For a random pair $(X, Y)$, the conditional pmf/pdf of $X$ given $Y = y$ is:
This is just Bayes' rule applied to random variables. Once you observe that $Y = y$, you "zoom in" on the slice of the joint distribution where $Y = y$ and renormalize it so the probabilities add up to 1. The denominator $f_Y(y)$ does that renormalization.
From Example 5, what is $f_{X|Y}(x | Y=1)$?
Given that $Y = 1$, there's an 80% chance that $X = 1$ and a 20% chance that $X = 0$. Notice the marginal was $P(X=1) = 0.7$, so observing $Y=1$ increased our belief that $X=1$ — they are positively associated.
3. Moments: Mean, Variance, Covariance
Moments are numerical summaries that capture key features of a distribution: its center, spread, and the relationships between variables.
3.1 Expectation (Mean)
The $k$-th moment of a random variable $X$ is:
The first moment $E(X)$ is the mean (or expected value) — the "center of mass" of the distribution.
For any constants $a_1, \ldots, a_p$ and random variables $X_1, \ldots, X_p$:
This holds always — no independence assumption needed. It's one of the most useful properties in probability.
Roll 100 dice. Let $S = X_1 + \cdots + X_{100}$. By linearity, $E(S) = 100 \cdot E(X_1) = 100 \cdot 3.5 = 350$. No need to enumerate all $6^{100}$ outcomes!
3.2 Variance and Standard Deviation
The variance of $X$ is the second central moment:
The standard deviation is $\sigma_X = \sqrt{\text{Var}(X)}$ and has the same units as $X$.
Variance measures how spread out a distribution is around its mean. A variance of zero means the random variable is actually a constant. The second formula $E(X^2) - [E(X)]^2$ is often easier to compute — it says: "the average of the squares minus the square of the average."
Let $X \sim \text{Bernoulli}(p)$, so $X \in \{0, 1\}$. Then $E(X) = p$, $E(X^2) = p$ (since $0^2=0$, $1^2=1$), and:
Maximum variance occurs at $p = 1/2$ (most uncertainty), and variance is 0 at $p=0$ or $p=1$ (no randomness).
3.3 Covariance and Correlation
For two random variables $X$ and $Y$:
Covariance tells you whether two variables tend to move together ($\text{Cov} > 0$), move in opposite directions ($\text{Cov} < 0$), or have no linear association ($\text{Cov} = 0$). Correlation is a normalized version on $[-1, 1]$ that is unit-free, making it easier to interpret.
Key properties to remember:
- $\text{Cov}(X, X) = \text{Var}(X)$.
- $\text{Cov}(X, Y) = \text{Cov}(Y, X)$ — it's symmetric.
- If $X$ and $Y$ are independent, then $\text{Cov}(X, Y) = 0$. (The converse is not true in general!)
- $\text{Var}\!\left(\sum a_i X_i\right) = \sum a_i^2 \text{Var}(X_i) + 2\sum_{i < j} a_i a_j \text{Cov}(X_i, X_j)$.
Let $X \sim \text{Uniform}\{-1, 0, 1\}$ and $Y = X^2$. Then $E(X) = 0$, so:
since $X^3$ takes values $\{-1, 0, 1\}$ symmetrically. Yet $Y$ is completely determined by $X$! This shows zero covariance doesn't mean independence — it only captures linear relationships.
3.4 The Covariance Matrix and Linear Transformations
For $\mathbf{X} = (X_1, \ldots, X_k)^\top$:
where $\boldsymbol{\mu} = E(\mathbf{X})$. The $(i,j)$-th entry of $\Sigma_{\mathbf{X}}$ is $\text{Cov}(X_i, X_j)$. Diagonal entries are the variances.
Let $\mathbf{Y} = A\mathbf{X}$ where $A$ is an $m \times k$ constant matrix. Then:
Let $S = X_1 + X_2$. This is $S = A\mathbf{X}$ with $A = (1 \;\; 1)$. Then:
If $X_1, X_2$ are independent, $\text{Cov}=0$ and $\text{Var}(S) = \text{Var}(X_1) + \text{Var}(X_2)$.
4. Conditional Expectation and Variance
The conditional expectation of $X$ given $Y = y$ is the mean of the conditional distribution $f_{X|Y}(\cdot|y)$:
$E(X|Y)$ is a random variable — it takes value $E(X|Y=y)$ when $Y=y$.
Imagine you want to predict someone's income ($X$) and you know their education level ($Y$). $E(X|Y=y)$ is the best prediction of income for people with education level $y$ — it's the average income within that group. As $Y$ varies, this prediction changes, which is why $E(X|Y)$ is itself a random variable.
Using Example 5 (and Example 7), $E(X | Y = 1) = 0 \cdot 0.2 + 1 \cdot 0.8 = 0.8$, and $E(X | Y = 0) = 0 \cdot 0.4 + 1 \cdot 0.6 = 0.6$.
(Here $f_{X|Y}(0|0) = 0.2/0.5 = 0.4$ and $f_{X|Y}(1|0) = 0.3/0.5 = 0.6$.)
4.1 The Tower Rule (Law of Iterated Expectations)
Divide a population into groups based on $Y$. Compute the average of $X$ within each group. Then take the (weighted) average of those group averages. You get back the overall average of $X$. Think of it as: the average of the group averages is the grand average.
Continuing the previous example: $E(X|Y=0) = 0.6$ and $E(X|Y=1) = 0.8$. Since $P(Y=0) = 0.5$ and $P(Y=1) = 0.5$:
4.2 Law of Total Variance
The total variability in $X$ comes from two sources:
- $E[\text{Var}(X|Y)]$: the average variability within each group (how spread out incomes are among people with the same education level).
- $\text{Var}(E(X|Y))$: the variability between group means (how much the average income differs across education levels).
This is exactly the "within-group" + "between-group" variance decomposition familiar from ANOVA.
5. Probability Inequalities
These inequalities are blunt but powerful tools: they give bounds on probabilities using only simple summaries (the mean, the variance). They are especially useful when you don't know the exact distribution.
If $X \ge 0$ and $t > 0$:
If a non-negative variable has a small mean, it can't be large too often. For instance, if the average commute time is 30 minutes, at most $30/120 = 25\%$ of commuters can have a commute $\ge 120$ minutes.
If $X$ has mean $\mu$ and variance $\sigma^2$, then for any $t > 0$:
Suppose exam scores have $\mu = 70$, $\sigma^2 = 100$ (so $\sigma = 10$). What fraction of students can score outside $[50, 90]$?
At most 25% — regardless of the exact distribution of scores.
For any two random variables $X, Y$ with finite second moments:
Substituting $X - E(X)$ and $Y - E(Y)$ gives $|\text{Cov}(X,Y)| \le \sigma_X \sigma_Y$, which is why $|\rho| \le 1$.
If $g$ is convex:
If $g$ is concave, the inequality reverses.
Since $g(x) = x^2$ is convex, $E(X^2) \ge [E(X)]^2$. This is equivalent to saying $\text{Var}(X) \ge 0$, a reassuring sanity check.
Since $\log$ is concave, $E(\log X) \le \log(E(X))$. This is used frequently in information theory and when working with likelihoods.
6. Convergence of Random Sequences
In statistics, we work with sequences of random variables (e.g., $\bar{X}_n$ as $n$ grows). We need precise notions of what it means for such a sequence to "approach" a limit.
6.1 Three Modes of Convergence
$X_n \xrightarrow{P} X$ if for every $\epsilon > 0$:
As $n$ grows, the probability that $X_n$ is "far" from $X$ shrinks to zero. The random variable $X_n$ might occasionally be far away, but such events become increasingly rare.
$X_n \xrightarrow{d} X$ if:
for all $x$ where $F_X$ is continuous.
This is the weakest form. It doesn't say the values $X_n$ get close to the values of $X$; it says the shape of the distribution of $X_n$ looks more and more like the distribution of $X$. Think of histograms: the histogram of $X_n$ gradually morphs into the density curve of $X$.
$X_n \xrightarrow{L^r} X$ if $\lim_{n \to \infty} E(|X_n - X|^r) = 0$.
$r = 1$: convergence in mean. $r = 2$: convergence in quadratic mean.
6.2 Relationships Between Modes
Convergence in distribution to a constant $c$ is equivalent to convergence in probability to $c$. This special case is used frequently (e.g., in the Law of Large Numbers).
6.3 Continuous Mapping Theorem
If $g$ is a continuous function, then convergence is preserved:
- $X_n \xrightarrow{d} X \implies g(X_n) \xrightarrow{d} g(X)$
- $X_n \xrightarrow{P} X \implies g(X_n) \xrightarrow{P} g(X)$
If $X_n$ is getting close to $X$, and you apply a smooth (continuous) transformation, the result $g(X_n)$ stays close to $g(X)$. Continuous functions don't "tear apart" things that are close together.
6.4 Slutsky's Theorem
If $X_n \xrightarrow{d} X$ and $A_n \xrightarrow{P} a$ (a constant), then $A_n X_n \xrightarrow{d} aX$.
You can "multiply" a distributional limit by something that converges to a constant, and the result behaves as expected. This is extremely useful when you need to replace an unknown quantity (like $\sigma$) with a consistent estimator.
7. The Big Theorems: LLN and CLT
7.1 The (Weak) Law of Large Numbers
If $X_1, X_2, \ldots$ are i.i.d. with finite mean $\mu$, then:
The sample average settles down to the true mean as you collect more data. This is why polls get more accurate with larger samples, why casinos always win in the long run, and why "practice makes perfect" has a statistical basis.
Flip a fair coin $n$ times. Let $X_i = 1$ if heads, 0 if tails. Then $\bar{X}_n$ is the proportion of heads. The WLLN says $\bar{X}_n \xrightarrow{P} 0.5$. After 10 flips you might get 0.3 or 0.7; after 10,000 flips, you'll be very close to 0.5.
7.2 The Central Limit Theorem (CLT)
If $X_1, X_2, \ldots$ are i.i.d. with mean $\mu$ and finite variance $\sigma^2 > 0$, then:
The LLN says $\bar{X}_n$ converges to $\mu$. The CLT goes further and tells you how it converges: the fluctuations around $\mu$ are approximately normally distributed, with size roughly $\sigma / \sqrt{n}$.
This is remarkable because it doesn't matter what the original distribution looks like — it could be Bernoulli, Exponential, Poisson, or anything else with finite variance. The sum/average always becomes approximately Normal for large $n$. This universality is why the Normal distribution appears everywhere in statistics.
Roll a die $n$ times. We have $\mu = 3.5$, $\sigma^2 = 35/12 \approx 2.917$. By the CLT:
For $n = 100$: $\bar{X}_{100} \approx N(3.5, \, 0.029)$, so $\sigma_{\bar{X}} \approx 0.17$. A 95% interval is roughly $3.5 \pm 0.34$.
If $\mathbf{X}_1, \ldots, \mathbf{X}_n$ are i.i.d. random vectors in $\mathbb{R}^k$ with mean $\boldsymbol{\mu}$ and covariance matrix $\Sigma$, then:
8. The Delta Method
Suppose $\sqrt{n}(Y_n - \mu) \xrightarrow{d} N(0, \sigma^2)$ and $g$ is differentiable at $\mu$ with $g'(\mu) \ne 0$. Then:
In the multivariate case, if $\sqrt{n}(\mathbf{Y}_n - \boldsymbol{\mu}) \xrightarrow{d} N(\mathbf{0}, \Sigma)$ and $g: \mathbb{R}^k \to \mathbb{R}$ is differentiable:
You already know that $Y_n$ is approximately normal (via CLT). Now you want to know the distribution of $g(Y_n)$. The delta method says: zoom in near $\mu$ and approximate $g$ by its tangent line (first-order Taylor expansion). Since a linear transformation of a Normal is still Normal, $g(Y_n)$ is approximately Normal too — with variance scaled by the squared slope $[g'(\mu)]^2$.
Suppose $\bar{X}_n \xrightarrow{} \mu$ with $\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)$, and we want the distribution of $g(\bar{X}_n) = \bar{X}_n^2$. Then $g'(\mu) = 2\mu$, so:
If $\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)$ and $\mu > 0$, with $g(x) = \log(x)$ and $g'(\mu) = 1/\mu$:
This is widely used to construct confidence intervals on a log-scale, which is natural for quantities like ratios or fold-changes.
Write $g(Y_n) \approx g(\mu) + g'(\mu)(Y_n - \mu)$. Then $\sqrt{n}(g(Y_n) - g(\mu)) \approx g'(\mu) \cdot \sqrt{n}(Y_n - \mu)$. Since $\sqrt{n}(Y_n - \mu) \xrightarrow{d} N(0, \sigma^2)$, multiplying by the constant $g'(\mu)$ scales the variance by $[g'(\mu)]^2$. That's the entire proof idea.
Key Takeaways
- The quantile function $F^{-1}(q)$ inverts the CDF: it maps probabilities back to values. The median is $F^{-1}(1/2)$.
- Joint distributions describe multiple random variables together. Marginals are obtained by summing/integrating out other variables. Independence means the joint equals the product of marginals.
- Covariance measures linear association; $\text{Cov} = 0$ does not imply independence. The covariance matrix generalizes variance to random vectors, and transforms via $A\Sigma A^\top$.
- The tower rule ($E[E(X|Y)] = E(X)$) and law of total variance decompose expectations and variances into within-group and between-group components.
- The LLN says $\bar{X}_n \to \mu$; the CLT says the fluctuations are Normal with scale $\sigma/\sqrt{n}$; the delta method extends the CLT to smooth transformations $g(\bar{X}_n)$.