Chapter 3 – Duality

This chapter introduces one of the most powerful ideas in optimization: duality. The core question is: given an optimization problem, can we construct another problem whose optimal value gives a bound on (or equals) the original? The answer is yes, and the machinery we build — the Lagrangian, dual functions, weak and strong duality, KKT conditions, and Slater's condition — forms the theoretical backbone for understanding, solving, and reasoning about convex programs.

Chapter Roadmap: We begin by reformulating a constrained problem as an unconstrained one using indicator functions, then relax these indicators into the Lagrangian (§3.1). We prove that the dual always gives a lower bound (§3.2, weak duality) and investigate when it gives the exact value (§3.3–3.5, strong duality and Slater's condition). The KKT conditions (§3.4) give us necessary and sufficient conditions at optimal points. We then develop a geometric picture (§3.6), use duality for feasibility proofs (§3.7, theorems of alternatives), and apply it all to the Dual SVM (§3.8).

3.1 Lagrange Dual Function and Dual Problem

The Big Idea: From Constraints to Penalties

Consider a general mathematical program with inequality and equality constraints:

\min f(x) \quad \text{s.t.} \quad g_i(x) \le 0 \;\; (i \in [m]), \quad h_j(x) = 0 \;\; (j \in [\ell]).

The domain of the program is $P := F \cap \bigcap_{i=1}^m G_i \cap \bigcap_{j=1}^\ell H_j$ and the optimal value is $f^*$.

A key observation is that we can absorb the constraints into the objective using indicator functions:

I_-(x) = \begin{cases} 0 & x \le 0 \\ +\infty & x > 0 \end{cases}, \qquad I_0(x) = \begin{cases} 0 & x = 0 \\ +\infty & x \ne 0. \end{cases}

This lets us rewrite the constrained problem as the unconstrained equivalent:

\min_{x} \; f(x) + \sum_{i=1}^m I_-(g_i(x)) + \sum_{j=1}^\ell I_0(h_j(x)).

Intuition

If $x$ satisfies all constraints, the indicators contribute zero. If it violates any constraint, the penalty is $+\infty$, so the point is effectively excluded. The problem is equivalent but the objective is nasty (non-convex, non-differentiable). The trick: replace these infinite-jump indicators with linear lower bounds.

For any $\lambda_i \ge 0$ and any $\nu_j \in \mathbb{R}$, we have:

\lambda_i \cdot x_i \;\le\; I_-(x_i) \qquad \text{and} \qquad \nu_j \cdot x_j \;\le\; I_0(x_j).

Replacing each indicator with its linear lower bound gives us something smaller than the true objective — in other words, a lower bound on $f^*$.

Definition — Lagrangian

The Lagrangian $L: P \times \mathbb{R}^m \times \mathbb{R}^\ell \to \mathbb{R}$ is defined as:

L(x, \lambda, \nu) := f(x) + \sum_{i=1}^m \lambda_i g_i(x) + \sum_{j=1}^\ell \nu_j h_j(x).

The vectors $\lambda \in \mathbb{R}^m$ and $\nu \in \mathbb{R}^\ell$ are called Lagrange multipliers.

Definition — Lagrange Dual Function

The Lagrange dual function $\hat{L}: \mathbb{R}^m \times \mathbb{R}^\ell \to \mathbb{R} \cup \{-\infty\}$ is:

\hat{L}(\lambda, \nu) := \inf_{x \in P} L(x, \lambda, \nu).

Think of it this way: for a fixed choice of multipliers $(\lambda, \nu)$, the dual function asks "what is the best (smallest) value the Lagrangian can take?" This infimum over $x$ eliminates the primal variable and gives us a function purely in terms of $(\lambda, \nu)$.

Proposition — The Dual Function is Always Concave

$\hat{L}(\lambda, \nu)$ is concave in $(\lambda, \nu)$, even if the original problem is not convex. This follows because $\hat{L}$ is a pointwise infimum of functions that are affine in $(\lambda, \nu)$, and the infimum of affine (hence concave) functions is concave.

Definition — Dual Feasible Region & Dual Program

The dual feasible region is $D := \{(\lambda, \nu) \in \mathbb{R}^m_{\ge 0} \times \mathbb{R}^\ell \mid \hat{L}(\lambda, \nu) > -\infty\}$. The Lagrange dual program is:

\hat{f}^* := \max_{\lambda \ge 0,\; \nu} \; \hat{L}(\lambda, \nu).

Example: Dual of a Linear Program

Example — LP Duality

Consider the standard-form LP: $\min c^\top x$ subject to $Ax = b$, $x \ge 0$. With multipliers $\lambda \ge 0$ (for $-x \le 0$) and $\nu$ (for $Ax = b$), the Lagrangian is:

L(x, \lambda, \nu) = -b^\top \nu + (c + A^\top \nu - \lambda)^\top x.

Since this is linear in $x$, the infimum is $-\infty$ unless $c + A^\top \nu - \lambda = 0$, giving the dual:

\max\; -b^\top \nu \quad \text{s.t.} \quad c + A^\top \nu \ge 0,

which is exactly the classical LP dual (after flipping sign conventions).

Important — The Dual Depends on the Formulation

Even cosmetic changes to the primal (adding a redundant variable, composing the objective with an increasing function) can change the dual problem. For instance, the unconstrained problem $\min f(Ax+b)$ has a trivial dual (just $f^*$), but rewriting it as $\min f(y)$ s.t. $Ax+b = y$ produces the more informative dual $\max\; b^\top\nu - f^*(\nu)$ subject to $A^\top \nu = 0$, where $f^*$ is the conjugate function.

3.2 Weak Duality

Theorem — Weak Duality

For every $(\lambda, \nu) \in D$:

\hat{L}(\lambda, \nu) \;\le\; f^*.

In particular, $\hat{f}^* \le f^*$.

The proof is elegant. Let $\tilde{x}$ be any primal feasible point. Since $g_i(\tilde{x}) \le 0$ and $\lambda_i \ge 0$, and $h_j(\tilde{x}) = 0$, the penalty terms $\sum \lambda_i g_i(\tilde{x}) + \sum \nu_j h_j(\tilde{x})$ are non-positive. Therefore:

\hat{L}(\lambda,\nu) = \inf_{x \in P} L(x,\lambda,\nu) \;\le\; L(\tilde{x},\lambda,\nu) \;\le\; f(\tilde{x}).

Since this holds for every feasible $\tilde{x}$, it holds for the best one, so $\hat{L}(\lambda,\nu) \le f^*$.

Why This Matters

Even for non-convex problems, the dual is always a convex program (since $\hat{L}$ is concave). This means we can always compute a lower bound on $f^*$ by solving a convex problem. This is extremely valuable for hard combinatorial problems.

Example — Two-Way Partitioning

Consider the non-convex problem $\min x^\top W x$ subject to $x_i^2 = 1$ for all $i$ (so each $x_i \in \{-1, +1\}$). This is NP-hard for large $n$. But the Lagrangian gives:

L(x, \nu) = x^\top(W + \text{diag}(\nu))x - \mathbf{1}^\top\nu.

The infimum of this quadratic form is $0$ when $W + \text{diag}(\nu) \succeq 0$ and $-\infty$ otherwise, yielding the convex dual:

\max -\mathbf{1}^\top\nu \quad \text{s.t.} \quad W + \text{diag}(\nu) \succeq 0.

A simple feasible point is $\nu = -\lambda_{\min}(W) \cdot \mathbf{1}$, giving the lower bound $n \cdot \lambda_{\min}(W)$.

3.3 Strong Duality

Definition — Duality Gap & Strong Duality

The duality gap is the quantity $f^* - \hat{f}^* \ge 0$. We say strong duality holds when this gap is zero, i.e., $f^* = \hat{f}^*$.

Strong duality does not always hold (especially for non-convex problems), but when it does, we get a powerful tool: solving the dual is equivalent to solving the primal. Strong duality enables complementary slackness, the KKT conditions, and is the foundation for interior-point methods.

Complementary Slackness

Proposition — Complementary Slackness

Let $x^*$ and $(\lambda^*, \nu^*)$ be optimal primal and dual solutions with $f^* = \hat{f}^*$. Then for every $i \in [m]$:

\lambda^*_i \cdot g_i(x^*) = 0.

In words: at optimality, for each constraint, either the multiplier is zero (the constraint "doesn't matter") or the constraint is active/tight (the constraint is binding). You can never have both a positive multiplier and slack in the constraint simultaneously.

Intuition — Resource Allocation

Think of $\lambda_i$ as the "price" of relaxing constraint $i$. If constraint $i$ has slack ($g_i(x^*) < 0$), then relaxing it wouldn't help — it's not a bottleneck — so its price $\lambda_i^*$ is zero. If $\lambda_i^* > 0$, the constraint must be tight: it's actively limiting the objective, so relaxing it would improve the optimum.

3.4 KKT Conditions

The Karush–Kuhn–Tucker (KKT) conditions capture what must hold at a pair of optimal primal and dual solutions when strong duality holds and the functions are differentiable. They unify feasibility, complementary slackness, and the gradient condition into a single system.

Definition — KKT Conditions

A point $(\tilde{x}, \tilde{\lambda}, \tilde{\nu})$ is a KKT-point if it satisfies:

Primal feasibility: $g_i(\tilde{x}) \le 0$ for all $i$, and $h_j(\tilde{x}) = 0$ for all $j$.
Dual feasibility: $\tilde{\lambda}_i \ge 0$ for all $i$.
Complementary slackness: $\tilde{\lambda}_i g_i(\tilde{x}) = 0$ for all $i$.
Stationarity: $\nabla f(\tilde{x}) + \sum_{i=1}^m \tilde{\lambda}_i \nabla g_i(\tilde{x}) + \sum_{j=1}^\ell \tilde{\nu}_j \nabla h_j(\tilde{x}) = 0.$

The stationarity condition says that at the optimum, the gradient of the objective is a non-negative combination of the constraint gradients. Geometrically: the direction you'd like to decrease the objective is blocked by the active constraints.

Proposition — KKT for General Programs (Necessary)

For any mathematical program with differentiable $f$, $g_i$, $h_j$ in which strong duality holds, any pair $x^*, (\lambda^*, \nu^*)$ of primal and dual optimal solutions forms a KKT-point.

Proposition — KKT for Convex Programs (Sufficient)

For any convex program with differentiable $f$, $g_i$, $h_j$: if $(\tilde{x}, \tilde{\lambda}, \tilde{\nu})$ is a KKT-point, then $\tilde{x}$ and $(\tilde{\lambda}, \tilde{\nu})$ are primal and dual optimal, and strong duality holds.

The Key Takeaway

For convex programs, the KKT conditions are both necessary (when strong duality holds) and sufficient for optimality. Most practical algorithms for solving convex problems (e.g., interior-point methods) are essentially methods for finding a KKT-point.

Example — When KKT Points Don't Exist

Consider $\min x$ subject to $x^2 \le 0$. The only feasible point is $x = 0$ with value $0$. The dual gives $\hat{L}(\lambda) = -1/(4\lambda)$ for $\lambda > 0$, approaching $0$ as $\lambda \to \infty$ but never reaching it. So strong duality holds ($f^* = \hat{f}^* = 0$) but the dual optimum is never attained, and there is no KKT-point. This happens because Slater's condition fails (you can't have $x^2 < 0$).

3.5 Strong Duality — Slater's Condition

Slater's condition is the most widely used constraint qualification — a sufficient condition guaranteeing that strong duality holds for convex programs.

Theorem — Slater's Condition

For any convex program, if there exists a point $x^* \in \text{relint}(P)$ such that $g_i(x^*) < 0$ for all non-affine $g_i$ and $h_j(x^*) = 0$ for all $j$, then:

Strong duality holds: $f^* = \hat{f}^*$.
The dual optimum is attained: there exist $\lambda^*, \nu^*$ achieving $\hat{f}^*$.

What Slater's Condition Really Asks

In plain terms: there must exist at least one "strictly feasible" point — a point that is in the interior of the feasible region (not just on the boundary). For affine constraints, being on the boundary is fine; only the non-linear inequality constraints need to be strictly satisfied.

Two important consequences deserve highlighting. First, since every feasible point of an LP satisfies all constraints with either equality or strict inequality (and all constraints are affine), strong duality always holds for linear programs. Second, Slater's condition is sufficient but not necessary: the example $\min x$ s.t. $x^2 \le 0$ has strong duality but Slater's condition fails.

Don't Forget Attainment

The second part of Slater's theorem — that the dual optimum is attained — is equally important in practice. It guarantees that we can actually find dual multipliers achieving the bound, which is needed for algorithms based on KKT conditions.

3.6 Geometric Interpretation

This section provides a beautiful visual way to understand weak duality, strong duality, and Slater's condition using a single geometric construction.

The Set $G$

For a mathematical program with one inequality constraint ($m = 1$, $\ell = 0$ for simplicity), define the set of achievable constraint-value and objective-value pairs:

G := \{(g_1(x), f(x)) : x \in P\} \subseteq \mathbb{R}^2.

In the $(u, t)$-plane, the primal optimum $f^*$ is the lowest point on the $t$-axis that the set $G$ achieves when $u \le 0$:

f^* = \inf\{t : (u, t) \in G,\; u \le 0\}.

Dual Values as Supporting Hyperplanes

For a given $\lambda \ge 0$, the dual value $\hat{L}(\lambda)$ equals the minimum of $\lambda u + t$ over $G$. Geometrically, the line $\lambda u + t = \hat{L}(\lambda)$ is a non-vertical supporting hyperplane of $G$. Its intersection with the $t$-axis (at $u = 0$) gives $\hat{L}(\lambda)$.

Visualizing the Duality Gap

Picture the set $G$ in the $(u, t)$-plane. The primal value $f^*$ is found by looking at where $G$ meets (or approaches) the vertical line $u = 0$ from the left. The dual value $\hat{L}(\lambda)$ is where the supporting line with slope $-\lambda$ meets the $t$-axis. The duality gap is the vertical distance between $f^*$ and $\hat{f}^*$ on the $t$-axis.

The Epigraph Set $A$

Extending $G$ by allowing worse constraint and objective values gives:

A := G \oplus (\mathbb{R}_{\ge 0} \times \mathbb{R}_{\ge 0}) = \{(u, t) : \exists\, x \in P,\; g_1(x) \le u,\; f(x) \le t\}.

This set is always convex for convex programs (since $g_1$ and $f$ are convex). Now $f^* = \inf\{t : (0, t) \in A\}$ and the dual values are still given by supporting hyperplanes of $A$.

Geometric Criterion for Strong Duality

Strong duality holds if and only if $A$ has a non-vertical supporting hyperplane at the point $(0, f^*)$. Slater's condition guarantees this by ensuring the boundary of $A$ at $u = 0$ is "smooth enough" for such a hyperplane to exist.

This geometric picture provides an elegant alternative proof of weak duality. Since $(0, f^*) \in A$, for any supporting hyperplane $(\lambda, 1)^\top(u, t) \ge \hat{L}(\lambda)$ we get $\hat{L}(\lambda) \le ({\lambda}, 1)^\top(0, f^*) = f^*$.

3.7 Theorems of Alternatives

Duality theory is not only about optimization — it can also prove that systems of inequalities are infeasible. The idea: if a system has no solution, there must be a "certificate of infeasibility" in the form of dual multipliers.

Weak Alternatives

Definition — Weak Alternatives

Two systems are weak alternatives if at most one of them is feasible.

Consider the primal system $g_i(x) \le 0$, $h_j(x) = 0$ versus the dual system $\hat{L}(\lambda, \nu) > 0$, $\lambda \ge 0$. By weak duality, at most one of these can be feasible: if the primal is feasible then $f^* = 0$ (the objective is zero) so $\hat{f}^* \le 0$, ruling out $\hat{L} > 0$.

Strong Alternatives

Definition — Strong Alternatives

Two systems are strong alternatives if exactly one of them is feasible.

When the constraints $g_i$ are convex and $h_j$ are affine, and a mild regularity condition holds (a point in $\text{relint}(P)$ satisfying the equalities exists), we upgrade from weak to strong alternatives. For instance, for strict inequalities:

Proposition — Strong Alternatives for Strict Inequalities

Under a regularity condition, exactly one of the following two systems is feasible:

\text{(I)}\;\; g_i(x) < 0,\; Ax = b \qquad \text{vs.} \qquad \text{(II)}\;\; \hat{L}(\lambda, \nu) \ge 0,\; \lambda \ge 0,\; \lambda \ne 0.

Intuition

Either you can strictly satisfy all constraints (the system is "really" feasible), or there exist dual multipliers certifying that the constraints are incompatible. There's no middle ground — this is the convex analogue of Farkas' Lemma from linear programming.

3.8 Applications — The Dual SVM

One of the most impactful applications of duality is the dual formulation of the Support Vector Machine (SVM). Recall from Chapter 2 that the (primal) SVM seeks the maximum-margin separating hyperplane:

\min_{w, b}\; \|w\|^2 \quad \text{s.t.} \quad y_i(w^\top x_i + b) \ge 1 \;\;\forall\, i \in [m]. \tag{P}

Deriving the Dual

Introducing multipliers $\lambda_i \ge 0$ for each constraint and forming the Lagrangian:

L(w, b, \lambda) = \|w\|^2 - \sum_{i=1}^m \lambda_i\big(y_i(w^\top x_i + b) - 1\big).

Setting partial derivatives to zero gives two conditions: $w = \frac{1}{2}\sum_i \lambda_i y_i x_i$ and $\sum_i \lambda_i y_i = 0$. Substituting back yields the dual:

\max_\lambda \; \sum_{i=1}^m \lambda_i - \frac{1}{4}\sum_{i,j} \lambda_i \lambda_j y_i y_j (x_i^\top x_j) \quad \text{s.t.} \quad \sum_i \lambda_i y_i = 0,\; \lambda_i \ge 0. \tag{D}

Why the Dual Formulation is Powerful

The dual has two major advantages over the primal:

Dimension independence: The primal has $n + 1$ variables (dimension of feature space), while the dual has $m$ variables (number of data points). In high-dimensional settings like image classification where $n \gg m$, the dual is much smaller.
The kernel trick: The dual and the classification rule depend only on inner products $x_i^\top x_j$ and $x_i^\top z$. This means we can replace these with a kernel function $k(x, y) = \Phi(x) \cdot \Phi(y)$ for some feature map $\Phi$ without ever computing $\Phi$ explicitly. For example, the Gaussian kernel $k(x, y) = e^{-\|x-y\|^2 / 2\sigma^2}$ implicitly maps into an infinite-dimensional space.

Example — Polynomial Kernel

Consider the feature map $\Phi: (x_1, x_2)^\top \mapsto (x_1^2, x_2^2, \sqrt{2}\, x_1 x_2)^\top$. Its kernel is simply $k(x, y) = (x^\top y)^2$. So instead of working in 3D, we stay in 2D and use this kernel in the dual — the math handles the "lifting" for us automatically.

Key Takeaways

The Lagrangian replaces hard indicator-function penalties with soft linear penalties parametrized by multipliers $\lambda, \nu$. The dual function $\hat{L}(\lambda, \nu) = \inf_x L(x, \lambda, \nu)$ is always concave.
Weak duality ($\hat{f}^* \le f^*$) always holds, giving a lower bound on the primal. This makes the dual valuable even for non-convex problems.
Strong duality ($\hat{f}^* = f^*$) holds under Slater's condition: a strictly feasible point for non-affine constraints must exist. It always holds for LPs.
KKT conditions are necessary (under strong duality) and sufficient (for convex programs) for optimality. Most optimization algorithms are methods for finding KKT-points.
Complementary slackness ($\lambda_i^* g_i(x^*) = 0$) tells us inactive constraints have zero multipliers. This connects structure of the primal to structure of the dual.
Theorems of alternatives use duality to certify infeasibility: either a system has a solution, or multipliers certifying its impossibility exist.
The Dual SVM demonstrates practical power: duality reduces dimensionality and enables the kernel trick for non-linear classification.