Lecture 2 – Quantiles, Multivariate Distributions, Moments & Convergence

This lecture builds on the foundation of random variables and CDFs from Lecture 1. We begin with the quantile function — the "inverse" of the CDF — which lets us go from probabilities back to values. We then extend our view to multivariate distributions, covering joint, marginal, and conditional distributions, and the key concept of independence for random variables. Next we formalize moments (mean, variance, covariance) and their properties, including the powerful tools of conditional expectation and the law of total variance. We close with probability inequalities and the major convergence theorems (LLN, CLT, delta method) that underpin all of statistical inference.

1. The Quantile Function

Definition — Quantile Function

Given a CDF $F$, the quantile function is defined as:

F^{-1}(q) = \inf\{x \in \mathbb{R} : F(x) \ge q\}, \quad q \in (0,1).

The value $F^{-1}(q)$ is called the $q$-quantile.

Intuition

The CDF answers: "given a value $x$, what is the probability of being at most $x$?" The quantile function asks the reverse question: "given a probability $q$, what is the smallest value $x$ such that at least a fraction $q$ of the distribution lies at or below $x$?"

Think of it like a class exam. The CDF tells you "if you scored 75, you're in the 80th percentile." The quantile function tells you "to be in the 80th percentile, you need at least a score of 75."

When is the quantile just the usual inverse?

If $F$ is strictly increasing and continuous, then $F$ is a bijection (one-to-one and onto) from its support to $(0,1)$, and $F^{-1}(q)$ is simply the ordinary inverse function evaluated at $q$. The "inf" in the general definition handles cases where $F$ has flat regions or jumps (as with discrete distributions).

Example 1 — Quantile of a continuous distribution

Let $X \sim \text{Uniform}(0,1)$, so $F(x) = x$ for $x \in [0,1]$. Then:

F^{-1}(q) = q

The 0.25-quantile is 0.25, the median ($q = 0.5$) is 0.5, and the 0.9-quantile is 0.9. This makes sense: for a uniform distribution, the quantiles are spread evenly across the interval.

Example 2 — Quantile of a discrete distribution (Binomial)

Recall from Lecture 1 the $\text{Binomial}(3, 1/2)$ distribution with CDF:

F(x) = \begin{cases} 0 & x < 0 \\ 1/8 & 0 \le x < 1 \\ 1/2 & 1 \le x < 2 \\ 7/8 & 2 \le x < 3 \\ 1 & x \ge 3 \end{cases}

The median is $F^{-1}(1/2) = \inf\{x : F(x) \ge 1/2\}$. Looking at the CDF, $F(x)$ first reaches $1/2$ at $x = 1$. So the median is $\mathbf{1}$.

Similarly, $F^{-1}(0.9) = \inf\{x : F(x) \ge 0.9\} = 3$, since $F(2) = 7/8 = 0.875 < 0.9$ but $F(3) = 1 \ge 0.9$.

Example 3 — Quantile of the Exponential distribution

Let $X \sim \text{Exp}(\lambda)$ with $F(x) = 1 - e^{-\lambda x}$ for $x \ge 0$. Since $F$ is continuous and strictly increasing on $[0, \infty)$, we solve $F(x) = q$ directly:

1 - e^{-\lambda x} = q \implies x = -\frac{\ln(1-q)}{\lambda}.

For $\lambda = 1$: the median is $-\ln(0.5) \approx 0.693$, and the 90th percentile is $-\ln(0.1) \approx 2.303$.

Special quantile: the Median ($q = 1/2$)

The median is the "middle value" of a distribution — half the probability mass lies below it, half above. For symmetric distributions (like the Normal), the median equals the mean. For skewed distributions, they can differ substantially.

2. Multivariate Distributions

So far we've dealt with a single random variable $X$. In practice, we often observe multiple quantities simultaneously — for instance, a person's height and weight, or today's temperature and humidity. We need tools to describe their joint behavior.

Definition — Random Vector

$\mathbf{X} = (X_1, \ldots, X_k)^\top$ is a random vector in $\mathbb{R}^k$ if $X_1, \ldots, X_k$ are all real random variables defined on the same sample space $\Omega$.

2.1 Joint CDF, PMF, and PDF

Definition — Joint CDF

The joint CDF of $\mathbf{X}$ is:

F_{\mathbf{X}}(\mathbf{x}) = P(X_1 \le x_1, \ldots, X_k \le x_k).

Just as for a single random variable, we distinguish two important cases:

Discrete: The joint pmf is $f_{\mathbf{X}}(\mathbf{x}) = P(X_1 = x_1, \ldots, X_k = x_k)$.
Absolutely continuous: There exists a joint pdf $f_{\mathbf{X}}(\mathbf{x}) \ge 0$ such that $F_{\mathbf{X}}(\mathbf{x}) = \int_{-\infty}^{x_k} \cdots \int_{-\infty}^{x_1} f_{\mathbf{X}}(t_1, \ldots, t_k)\,dt_1 \cdots dt_k$.

Example 4 — Joint PMF of two dice

Roll two fair dice. Let $X_1$ = result of die 1 and $X_2$ = result of die 2. The joint pmf is:

f_{X_1, X_2}(x_1, x_2) = \frac{1}{36} \quad \text{for } x_1, x_2 \in \{1, 2, 3, 4, 5, 6\}.

Every pair is equally likely. For example, $P(X_1 = 3, X_2 = 5) = 1/36$.

2.2 Marginal Distributions

Definition — Marginal PMF / PDF

The marginal distribution of a single component $X_i$ is obtained by "summing out" (discrete) or "integrating out" (continuous) all other variables from the joint distribution:

\text{Discrete: } f_{X_i}(x) = \sum_{x_j : j \ne i} f_{\mathbf{X}}(x_1, \ldots, x_k)

\text{Continuous: } f_{X_i}(x) = \int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} f_{\mathbf{X}}(x_1,\ldots,x_{i-1}, x, x_{i+1},\ldots,x_k)\,dx_1 \cdots dx_{i-1}\,dx_{i+1} \cdots dx_k

Intuition

Think of the joint distribution as a complete spreadsheet of data about every variable together. The marginal distribution is what you get when you focus on just one column and ignore the rest. The word "marginal" comes from the idea of writing row/column totals in the margins of a table.

Example 5 — From joint to marginal

Suppose $(X, Y)$ has joint pmf given by this table:

	$Y=0$	$Y=1$	Marginal of $X$
$X=0$	0.2	0.1	0.3
$X=1$	0.3	0.4	0.7
Marginal of $Y$	0.5	0.5	1.0

For instance, $f_X(1) = f(1,0) + f(1,1) = 0.3 + 0.4 = 0.7$. The marginals appear in the "margins" of the table — hence the name.

2.3 Independence of Random Variables

Definition — Mutual Independence

Random variables $X_1, \ldots, X_k$ are mutually independent if for all subsets of indices and all events:

P(X_{i_1} \in A_{i_1}, \ldots, X_{i_j} \in A_{i_j}) = \prod_{m=1}^{j} P(X_{i_m} \in A_{i_m}).

Characterization — Independence via Joint = Product of Marginals

$X_1, \ldots, X_k$ are mutually independent if and only if for all $\mathbf{x} \in \mathbb{R}^k$:

f_{\mathbf{X}}(\mathbf{x}) = \prod_{i=1}^{k} f_{X_i}(x_i).

In words: the joint pmf/pdf factors into the product of the marginals.

Intuition

Independence means knowing one variable tells you nothing about the others. If the joint distribution equals the product of the individual (marginal) distributions, then the variables carry no information about each other. This is the multivariate generalization of the event independence $P(A \cap B) = P(A)P(B)$ from Lecture 1.

Example 6 — Checking independence from a table

Using Example 5: is $P(X=1, Y=0) = P(X=1) \cdot P(Y=0)$? We have $0.3 \ne 0.7 \times 0.5 = 0.35$. So $X$ and $Y$ are not independent.

If instead the table were $f(0,0)=0.15$, $f(0,1)=0.15$, $f(1,0)=0.35$, $f(1,1)=0.35$, then every cell would equal the product of its row and column marginals, and $X, Y$ would be independent.

2.4 Conditional Distributions

Definition — Conditional PMF / PDF

For a random pair $(X, Y)$, the conditional pmf/pdf of $X$ given $Y = y$ is:

f_{X|Y}(x|y) = \frac{f(x, y)}{f_Y(y)}, \quad \text{provided } f_Y(y) > 0.

Intuition

This is just Bayes' rule applied to random variables. Once you observe that $Y = y$, you "zoom in" on the slice of the joint distribution where $Y = y$ and renormalize it so the probabilities add up to 1. The denominator $f_Y(y)$ does that renormalization.

Example 7 — Conditional PMF from a table

From Example 5, what is $f_{X|Y}(x | Y=1)$?

f_{X|Y}(0|1) = \frac{f(0,1)}{f_Y(1)} = \frac{0.1}{0.5} = 0.2, \qquad f_{X|Y}(1|1) = \frac{f(1,1)}{f_Y(1)} = \frac{0.4}{0.5} = 0.8.

Given that $Y = 1$, there's an 80% chance that $X = 1$ and a 20% chance that $X = 0$. Notice the marginal was $P(X=1) = 0.7$, so observing $Y=1$ increased our belief that $X=1$ — they are positively associated.

3. Moments: Mean, Variance, Covariance

Moments are numerical summaries that capture key features of a distribution: its center, spread, and the relationships between variables.

3.1 Expectation (Mean)

Definition — $k$-th Moment

The $k$-th moment of a random variable $X$ is:

E(X^k) = \int_{-\infty}^{\infty} x^k \, dF_X(x).

The first moment $E(X)$ is the mean (or expected value) — the "center of mass" of the distribution.

Key property: Linearity of Expectation

For any constants $a_1, \ldots, a_p$ and random variables $X_1, \ldots, X_p$:

E\!\left(\sum_{i=1}^{p} a_i X_i\right) = \sum_{i=1}^{p} a_i\, E(X_i).

This holds always — no independence assumption needed. It's one of the most useful properties in probability.

Example 8 — Average of dice via linearity

Roll 100 dice. Let $S = X_1 + \cdots + X_{100}$. By linearity, $E(S) = 100 \cdot E(X_1) = 100 \cdot 3.5 = 350$. No need to enumerate all $6^{100}$ outcomes!

3.2 Variance and Standard Deviation

Definition — Variance

The variance of $X$ is the second central moment:

\text{Var}(X) = E[(X - E(X))^2] = E(X^2) - [E(X)]^2.

The standard deviation is $\sigma_X = \sqrt{\text{Var}(X)}$ and has the same units as $X$.

Intuition

Variance measures how spread out a distribution is around its mean. A variance of zero means the random variable is actually a constant. The second formula $E(X^2) - [E(X)]^2$ is often easier to compute — it says: "the average of the squares minus the square of the average."

Example 9 — Variance of a coin flip

Let $X \sim \text{Bernoulli}(p)$, so $X \in \{0, 1\}$. Then $E(X) = p$, $E(X^2) = p$ (since $0^2=0$, $1^2=1$), and:

\text{Var}(X) = p - p^2 = p(1-p).

Maximum variance occurs at $p = 1/2$ (most uncertainty), and variance is 0 at $p=0$ or $p=1$ (no randomness).

3.3 Covariance and Correlation

Definition — Covariance and Correlation

For two random variables $X$ and $Y$:

\text{Cov}(X, Y) = E[(X - E(X))(Y - E(Y))] = E(XY) - E(X)E(Y)

\rho_{X,Y} = \frac{\text{Cov}(X, Y)}{\sigma_X \, \sigma_Y}, \quad -1 \le \rho_{X,Y} \le 1.

Intuition

Covariance tells you whether two variables tend to move together ($\text{Cov} > 0$), move in opposite directions ($\text{Cov} < 0$), or have no linear association ($\text{Cov} = 0$). Correlation is a normalized version on $[-1, 1]$ that is unit-free, making it easier to interpret.

Key properties to remember:

$\text{Cov}(X, X) = \text{Var}(X)$.
$\text{Cov}(X, Y) = \text{Cov}(Y, X)$ — it's symmetric.
If $X$ and $Y$ are independent, then $\text{Cov}(X, Y) = 0$. (The converse is not true in general!)
$\text{Var}\!\left(\sum a_i X_i\right) = \sum a_i^2 \text{Var}(X_i) + 2\sum_{i < j} a_i a_j \text{Cov}(X_i, X_j)$.

Example 10 — Zero covariance does not imply independence

Let $X \sim \text{Uniform}\{-1, 0, 1\}$ and $Y = X^2$. Then $E(X) = 0$, so:

\text{Cov}(X, Y) = E(XY) - E(X)E(Y) = E(X^3) - 0 = 0,

since $X^3$ takes values $\{-1, 0, 1\}$ symmetrically. Yet $Y$ is completely determined by $X$! This shows zero covariance doesn't mean independence — it only captures linear relationships.

3.4 The Covariance Matrix and Linear Transformations

Definition — Mean Vector and Covariance Matrix

For $\mathbf{X} = (X_1, \ldots, X_k)^\top$:

E(\mathbf{X}) = \begin{pmatrix} E(X_1) \\ \vdots \\ E(X_k) \end{pmatrix}, \qquad \Sigma_{\mathbf{X}} = E[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^\top]

where $\boldsymbol{\mu} = E(\mathbf{X})$. The $(i,j)$-th entry of $\Sigma_{\mathbf{X}}$ is $\text{Cov}(X_i, X_j)$. Diagonal entries are the variances.

Theorem — Linear Transformation of Random Vectors

Let $\mathbf{Y} = A\mathbf{X}$ where $A$ is an $m \times k$ constant matrix. Then:

E(\mathbf{Y}) = A\,E(\mathbf{X}), \qquad \text{Var}(\mathbf{Y}) = A\,\text{Var}(\mathbf{X})\,A^\top.

Example 11 — Variance of a sum of two variables

Let $S = X_1 + X_2$. This is $S = A\mathbf{X}$ with $A = (1 \;\; 1)$. Then:

\text{Var}(S) = A \Sigma A^\top = \text{Var}(X_1) + 2\text{Cov}(X_1,X_2) + \text{Var}(X_2).

If $X_1, X_2$ are independent, $\text{Cov}=0$ and $\text{Var}(S) = \text{Var}(X_1) + \text{Var}(X_2)$.

4. Conditional Expectation and Variance

Definition — Conditional Expectation

The conditional expectation of $X$ given $Y = y$ is the mean of the conditional distribution $f_{X|Y}(\cdot|y)$:

E(X|Y=y) = \begin{cases} \sum_x x\, f_{X|Y}(x|y) & \text{(discrete)} \\ \int_{\mathbb{R}} x\, f_{X|Y}(x|y)\,dx & \text{(continuous)} \end{cases}

$E(X|Y)$ is a random variable — it takes value $E(X|Y=y)$ when $Y=y$.

Intuition

Imagine you want to predict someone's income ($X$) and you know their education level ($Y$). $E(X|Y=y)$ is the best prediction of income for people with education level $y$ — it's the average income within that group. As $Y$ varies, this prediction changes, which is why $E(X|Y)$ is itself a random variable.

Example 12 — Conditional expectation from a table

Using Example 5 (and Example 7), $E(X | Y = 1) = 0 \cdot 0.2 + 1 \cdot 0.8 = 0.8$, and $E(X | Y = 0) = 0 \cdot 0.4 + 1 \cdot 0.6 = 0.6$.

(Here $f_{X|Y}(0|0) = 0.2/0.5 = 0.4$ and $f_{X|Y}(1|0) = 0.3/0.5 = 0.6$.)

4.1 The Tower Rule (Law of Iterated Expectations)

Theorem — Tower Rule

$$ E[E(X|Y)] = E(X). $$

Intuition

Divide a population into groups based on $Y$. Compute the average of $X$ within each group. Then take the (weighted) average of those group averages. You get back the overall average of $X$. Think of it as: the average of the group averages is the grand average.

Example 13 — Tower rule verification

Continuing the previous example: $E(X|Y=0) = 0.6$ and $E(X|Y=1) = 0.8$. Since $P(Y=0) = 0.5$ and $P(Y=1) = 0.5$:

E[E(X|Y)] = 0.6 \times 0.5 + 0.8 \times 0.5 = 0.7 = E(X). \checkmark

4.2 Law of Total Variance

Theorem — Law of Total Variance

\text{Var}(X) = E[\text{Var}(X|Y)] + \text{Var}(E(X|Y)).

Intuition

The total variability in $X$ comes from two sources:

$E[\text{Var}(X|Y)]$: the average variability within each group (how spread out incomes are among people with the same education level).
$\text{Var}(E(X|Y))$: the variability between group means (how much the average income differs across education levels).

This is exactly the "within-group" + "between-group" variance decomposition familiar from ANOVA.

5. Probability Inequalities

These inequalities are blunt but powerful tools: they give bounds on probabilities using only simple summaries (the mean, the variance). They are especially useful when you don't know the exact distribution.

Markov's Inequality

If $X \ge 0$ and $t > 0$:

P(X \ge t) \le \frac{E(X)}{t}.

Intuition

If a non-negative variable has a small mean, it can't be large too often. For instance, if the average commute time is 30 minutes, at most $30/120 = 25\%$ of commuters can have a commute $\ge 120$ minutes.

Chebyshev's Inequality

If $X$ has mean $\mu$ and variance $\sigma^2$, then for any $t > 0$:

P(|X - \mu| \ge t) \le \frac{\sigma^2}{t^2}.

Example 14 — Chebyshev in action

Suppose exam scores have $\mu = 70$, $\sigma^2 = 100$ (so $\sigma = 10$). What fraction of students can score outside $[50, 90]$?

P(|X - 70| \ge 20) \le \frac{100}{400} = 0.25.

At most 25% — regardless of the exact distribution of scores.

Cauchy-Schwarz Inequality

For any two random variables $X, Y$ with finite second moments:

E(|XY|) \le \sqrt{E(X^2)} \cdot \sqrt{E(Y^2)}.

Substituting $X - E(X)$ and $Y - E(Y)$ gives $|\text{Cov}(X,Y)| \le \sigma_X \sigma_Y$, which is why $|\rho| \le 1$.

Jensen's Inequality

If $g$ is convex:

E(g(X)) \ge g(E(X)).

If $g$ is concave, the inequality reverses.

Example 15 — Jensen's for $g(x) = x^2$

Since $g(x) = x^2$ is convex, $E(X^2) \ge [E(X)]^2$. This is equivalent to saying $\text{Var}(X) \ge 0$, a reassuring sanity check.

Example 16 — Jensen's for $g(x) = \log(x)$

Since $\log$ is concave, $E(\log X) \le \log(E(X))$. This is used frequently in information theory and when working with likelihoods.

6. Convergence of Random Sequences

In statistics, we work with sequences of random variables (e.g., $\bar{X}_n$ as $n$ grows). We need precise notions of what it means for such a sequence to "approach" a limit.

6.1 Three Modes of Convergence

Definition — Convergence in Probability

$X_n \xrightarrow{P} X$ if for every $\epsilon > 0$:

\lim_{n \to \infty} P(|X_n - X| > \epsilon) = 0.

Intuition

As $n$ grows, the probability that $X_n$ is "far" from $X$ shrinks to zero. The random variable $X_n$ might occasionally be far away, but such events become increasingly rare.

Definition — Convergence in Distribution

$X_n \xrightarrow{d} X$ if:

\lim_{n \to \infty} F_{X_n}(x) = F_X(x)

for all $x$ where $F_X$ is continuous.

Intuition

This is the weakest form. It doesn't say the values $X_n$ get close to the values of $X$; it says the shape of the distribution of $X_n$ looks more and more like the distribution of $X$. Think of histograms: the histogram of $X_n$ gradually morphs into the density curve of $X$.

Definition — Convergence in $L^r$

$X_n \xrightarrow{L^r} X$ if $\lim_{n \to \infty} E(|X_n - X|^r) = 0$.

$r = 1$: convergence in mean. $r = 2$: convergence in quadratic mean.

6.2 Relationships Between Modes

Figure 1: Implications between modes of convergence. The dashed arrow holds only when the limit is a constant.

Important

Convergence in distribution to a constant $c$ is equivalent to convergence in probability to $c$. This special case is used frequently (e.g., in the Law of Large Numbers).

6.3 Continuous Mapping Theorem

Theorem — Continuous Mapping

If $g$ is a continuous function, then convergence is preserved:

$X_n \xrightarrow{d} X \implies g(X_n) \xrightarrow{d} g(X)$
$X_n \xrightarrow{P} X \implies g(X_n) \xrightarrow{P} g(X)$

Intuition

If $X_n$ is getting close to $X$, and you apply a smooth (continuous) transformation, the result $g(X_n)$ stays close to $g(X)$. Continuous functions don't "tear apart" things that are close together.

6.4 Slutsky's Theorem

Theorem — Slutsky

If $X_n \xrightarrow{d} X$ and $A_n \xrightarrow{P} a$ (a constant), then $A_n X_n \xrightarrow{d} aX$.

Intuition

You can "multiply" a distributional limit by something that converges to a constant, and the result behaves as expected. This is extremely useful when you need to replace an unknown quantity (like $\sigma$) with a consistent estimator.

7. The Big Theorems: LLN and CLT

7.1 The (Weak) Law of Large Numbers

Theorem — Weak Law of Large Numbers (WLLN)

If $X_1, X_2, \ldots$ are i.i.d. with finite mean $\mu$, then:

\bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i \xrightarrow{P} \mu.

Intuition

The sample average settles down to the true mean as you collect more data. This is why polls get more accurate with larger samples, why casinos always win in the long run, and why "practice makes perfect" has a statistical basis.

Example 17 — LLN for coin flips

Flip a fair coin $n$ times. Let $X_i = 1$ if heads, 0 if tails. Then $\bar{X}_n$ is the proportion of heads. The WLLN says $\bar{X}_n \xrightarrow{P} 0.5$. After 10 flips you might get 0.3 or 0.7; after 10,000 flips, you'll be very close to 0.5.

7.2 The Central Limit Theorem (CLT)

Theorem — Central Limit Theorem

If $X_1, X_2, \ldots$ are i.i.d. with mean $\mu$ and finite variance $\sigma^2 > 0$, then:

\frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \xrightarrow{d} N(0, 1).

Intuition

The LLN says $\bar{X}_n$ converges to $\mu$. The CLT goes further and tells you how it converges: the fluctuations around $\mu$ are approximately normally distributed, with size roughly $\sigma / \sqrt{n}$.

This is remarkable because it doesn't matter what the original distribution looks like — it could be Bernoulli, Exponential, Poisson, or anything else with finite variance. The sum/average always becomes approximately Normal for large $n$. This universality is why the Normal distribution appears everywhere in statistics.

Example 18 — CLT for dice

Roll a die $n$ times. We have $\mu = 3.5$, $\sigma^2 = 35/12 \approx 2.917$. By the CLT:

\bar{X}_n \approx N\!\left(3.5, \; \frac{2.917}{n}\right) \text{ for large } n.

For $n = 100$: $\bar{X}_{100} \approx N(3.5, \, 0.029)$, so $\sigma_{\bar{X}} \approx 0.17$. A 95% interval is roughly $3.5 \pm 0.34$.

Theorem — Multivariate CLT

If $\mathbf{X}_1, \ldots, \mathbf{X}_n$ are i.i.d. random vectors in $\mathbb{R}^k$ with mean $\boldsymbol{\mu}$ and covariance matrix $\Sigma$, then:

\sqrt{n}(\bar{\mathbf{X}}_n - \boldsymbol{\mu}) \xrightarrow{d} N(\mathbf{0}, \Sigma).

8. The Delta Method

Theorem — Delta Method

Suppose $\sqrt{n}(Y_n - \mu) \xrightarrow{d} N(0, \sigma^2)$ and $g$ is differentiable at $\mu$ with $g'(\mu) \ne 0$. Then:

\sqrt{n}(g(Y_n) - g(\mu)) \xrightarrow{d} N\!\left(0, \; [g'(\mu)]^2 \sigma^2\right).

In the multivariate case, if $\sqrt{n}(\mathbf{Y}_n - \boldsymbol{\mu}) \xrightarrow{d} N(\mathbf{0}, \Sigma)$ and $g: \mathbb{R}^k \to \mathbb{R}$ is differentiable:

\sqrt{n}(g(\mathbf{Y}_n) - g(\boldsymbol{\mu})) \xrightarrow{d} N\!\left(0, \; (\nabla g(\boldsymbol{\mu}))^\top \Sigma \, \nabla g(\boldsymbol{\mu})\right).

Intuition

You already know that $Y_n$ is approximately normal (via CLT). Now you want to know the distribution of $g(Y_n)$. The delta method says: zoom in near $\mu$ and approximate $g$ by its tangent line (first-order Taylor expansion). Since a linear transformation of a Normal is still Normal, $g(Y_n)$ is approximately Normal too — with variance scaled by the squared slope $[g'(\mu)]^2$.

Example 19 — Estimating a probability transformation

Suppose $\bar{X}_n \xrightarrow{} \mu$ with $\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)$, and we want the distribution of $g(\bar{X}_n) = \bar{X}_n^2$. Then $g'(\mu) = 2\mu$, so:

\sqrt{n}(\bar{X}_n^2 - \mu^2) \xrightarrow{d} N(0, \; 4\mu^2 \sigma^2).

Example 20 — Log transformation

If $\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)$ and $\mu > 0$, with $g(x) = \log(x)$ and $g'(\mu) = 1/\mu$:

\sqrt{n}(\log \bar{X}_n - \log \mu) \xrightarrow{d} N\!\left(0, \frac{\sigma^2}{\mu^2}\right).

This is widely used to construct confidence intervals on a log-scale, which is natural for quantities like ratios or fold-changes.

Why it works: Taylor expansion

Write $g(Y_n) \approx g(\mu) + g'(\mu)(Y_n - \mu)$. Then $\sqrt{n}(g(Y_n) - g(\mu)) \approx g'(\mu) \cdot \sqrt{n}(Y_n - \mu)$. Since $\sqrt{n}(Y_n - \mu) \xrightarrow{d} N(0, \sigma^2)$, multiplying by the constant $g'(\mu)$ scales the variance by $[g'(\mu)]^2$. That's the entire proof idea.

Key Takeaways

The quantile function $F^{-1}(q)$ inverts the CDF: it maps probabilities back to values. The median is $F^{-1}(1/2)$.
Joint distributions describe multiple random variables together. Marginals are obtained by summing/integrating out other variables. Independence means the joint equals the product of marginals.
Covariance measures linear association; $\text{Cov} = 0$ does not imply independence. The covariance matrix generalizes variance to random vectors, and transforms via $A\Sigma A^\top$.
The tower rule ($E[E(X|Y)] = E(X)$) and law of total variance decompose expectations and variances into within-group and between-group components.
The LLN says $\bar{X}_n \to \mu$; the CLT says the fluctuations are Normal with scale $\sigma/\sqrt{n}$; the delta method extends the CLT to smooth transformations $g(\bar{X}_n)$.