This chapter introduces one of the most powerful ideas in optimization: duality. The core question is: given an optimization problem, can we construct another problem whose optimal value gives a bound on (or equals) the original? The answer is yes, and the machinery we build — the Lagrangian, dual functions, weak and strong duality, KKT conditions, and Slater's condition — forms the theoretical backbone for understanding, solving, and reasoning about convex programs.
3.1 Lagrange Dual Function and Dual Problem
The Big Idea: From Constraints to Penalties
Consider a general mathematical program with inequality and equality constraints:
The domain of the program is $P := F \cap \bigcap_{i=1}^m G_i \cap \bigcap_{j=1}^\ell H_j$ and the optimal value is $f^*$.
A key observation is that we can absorb the constraints into the objective using indicator functions:
This lets us rewrite the constrained problem as the unconstrained equivalent:
If $x$ satisfies all constraints, the indicators contribute zero. If it violates any constraint, the penalty is $+\infty$, so the point is effectively excluded. The problem is equivalent but the objective is nasty (non-convex, non-differentiable). The trick: replace these infinite-jump indicators with linear lower bounds.
For any $\lambda_i \ge 0$ and any $\nu_j \in \mathbb{R}$, we have:
Replacing each indicator with its linear lower bound gives us something smaller than the true objective — in other words, a lower bound on $f^*$.
The Lagrangian $L: P \times \mathbb{R}^m \times \mathbb{R}^\ell \to \mathbb{R}$ is defined as:
The vectors $\lambda \in \mathbb{R}^m$ and $\nu \in \mathbb{R}^\ell$ are called Lagrange multipliers.
The Lagrange dual function $\hat{L}: \mathbb{R}^m \times \mathbb{R}^\ell \to \mathbb{R} \cup \{-\infty\}$ is:
Think of it this way: for a fixed choice of multipliers $(\lambda, \nu)$, the dual function asks "what is the best (smallest) value the Lagrangian can take?" This infimum over $x$ eliminates the primal variable and gives us a function purely in terms of $(\lambda, \nu)$.
$\hat{L}(\lambda, \nu)$ is concave in $(\lambda, \nu)$, even if the original problem is not convex. This follows because $\hat{L}$ is a pointwise infimum of functions that are affine in $(\lambda, \nu)$, and the infimum of affine (hence concave) functions is concave.
The dual feasible region is $D := \{(\lambda, \nu) \in \mathbb{R}^m_{\ge 0} \times \mathbb{R}^\ell \mid \hat{L}(\lambda, \nu) > -\infty\}$. The Lagrange dual program is:
Example: Dual of a Linear Program
Consider the standard-form LP: $\min c^\top x$ subject to $Ax = b$, $x \ge 0$. With multipliers $\lambda \ge 0$ (for $-x \le 0$) and $\nu$ (for $Ax = b$), the Lagrangian is:
Since this is linear in $x$, the infimum is $-\infty$ unless $c + A^\top \nu - \lambda = 0$, giving the dual:
which is exactly the classical LP dual (after flipping sign conventions).
Even cosmetic changes to the primal (adding a redundant variable, composing the objective with an increasing function) can change the dual problem. For instance, the unconstrained problem $\min f(Ax+b)$ has a trivial dual (just $f^*$), but rewriting it as $\min f(y)$ s.t. $Ax+b = y$ produces the more informative dual $\max\; b^\top\nu - f^*(\nu)$ subject to $A^\top \nu = 0$, where $f^*$ is the conjugate function.
3.2 Weak Duality
For every $(\lambda, \nu) \in D$:
In particular, $\hat{f}^* \le f^*$.
The proof is elegant. Let $\tilde{x}$ be any primal feasible point. Since $g_i(\tilde{x}) \le 0$ and $\lambda_i \ge 0$, and $h_j(\tilde{x}) = 0$, the penalty terms $\sum \lambda_i g_i(\tilde{x}) + \sum \nu_j h_j(\tilde{x})$ are non-positive. Therefore:
Since this holds for every feasible $\tilde{x}$, it holds for the best one, so $\hat{L}(\lambda,\nu) \le f^*$.
Even for non-convex problems, the dual is always a convex program (since $\hat{L}$ is concave). This means we can always compute a lower bound on $f^*$ by solving a convex problem. This is extremely valuable for hard combinatorial problems.
Consider the non-convex problem $\min x^\top W x$ subject to $x_i^2 = 1$ for all $i$ (so each $x_i \in \{-1, +1\}$). This is NP-hard for large $n$. But the Lagrangian gives:
The infimum of this quadratic form is $0$ when $W + \text{diag}(\nu) \succeq 0$ and $-\infty$ otherwise, yielding the convex dual:
A simple feasible point is $\nu = -\lambda_{\min}(W) \cdot \mathbf{1}$, giving the lower bound $n \cdot \lambda_{\min}(W)$.
3.3 Strong Duality
The duality gap is the quantity $f^* - \hat{f}^* \ge 0$. We say strong duality holds when this gap is zero, i.e., $f^* = \hat{f}^*$.
Strong duality does not always hold (especially for non-convex problems), but when it does, we get a powerful tool: solving the dual is equivalent to solving the primal. Strong duality enables complementary slackness, the KKT conditions, and is the foundation for interior-point methods.
Complementary Slackness
Let $x^*$ and $(\lambda^*, \nu^*)$ be optimal primal and dual solutions with $f^* = \hat{f}^*$. Then for every $i \in [m]$:
In words: at optimality, for each constraint, either the multiplier is zero (the constraint "doesn't matter") or the constraint is active/tight (the constraint is binding). You can never have both a positive multiplier and slack in the constraint simultaneously.
Think of $\lambda_i$ as the "price" of relaxing constraint $i$. If constraint $i$ has slack ($g_i(x^*) < 0$), then relaxing it wouldn't help — it's not a bottleneck — so its price $\lambda_i^*$ is zero. If $\lambda_i^* > 0$, the constraint must be tight: it's actively limiting the objective, so relaxing it would improve the optimum.
3.4 KKT Conditions
The Karush–Kuhn–Tucker (KKT) conditions capture what must hold at a pair of optimal primal and dual solutions when strong duality holds and the functions are differentiable. They unify feasibility, complementary slackness, and the gradient condition into a single system.
A point $(\tilde{x}, \tilde{\lambda}, \tilde{\nu})$ is a KKT-point if it satisfies:
- Primal feasibility: $g_i(\tilde{x}) \le 0$ for all $i$, and $h_j(\tilde{x}) = 0$ for all $j$.
- Dual feasibility: $\tilde{\lambda}_i \ge 0$ for all $i$.
- Complementary slackness: $\tilde{\lambda}_i g_i(\tilde{x}) = 0$ for all $i$.
- Stationarity: $\nabla f(\tilde{x}) + \sum_{i=1}^m \tilde{\lambda}_i \nabla g_i(\tilde{x}) + \sum_{j=1}^\ell \tilde{\nu}_j \nabla h_j(\tilde{x}) = 0.$
The stationarity condition says that at the optimum, the gradient of the objective is a non-negative combination of the constraint gradients. Geometrically: the direction you'd like to decrease the objective is blocked by the active constraints.
For any mathematical program with differentiable $f$, $g_i$, $h_j$ in which strong duality holds, any pair $x^*, (\lambda^*, \nu^*)$ of primal and dual optimal solutions forms a KKT-point.
For any convex program with differentiable $f$, $g_i$, $h_j$: if $(\tilde{x}, \tilde{\lambda}, \tilde{\nu})$ is a KKT-point, then $\tilde{x}$ and $(\tilde{\lambda}, \tilde{\nu})$ are primal and dual optimal, and strong duality holds.
For convex programs, the KKT conditions are both necessary (when strong duality holds) and sufficient for optimality. Most practical algorithms for solving convex problems (e.g., interior-point methods) are essentially methods for finding a KKT-point.
Consider $\min x$ subject to $x^2 \le 0$. The only feasible point is $x = 0$ with value $0$. The dual gives $\hat{L}(\lambda) = -1/(4\lambda)$ for $\lambda > 0$, approaching $0$ as $\lambda \to \infty$ but never reaching it. So strong duality holds ($f^* = \hat{f}^* = 0$) but the dual optimum is never attained, and there is no KKT-point. This happens because Slater's condition fails (you can't have $x^2 < 0$).
3.5 Strong Duality — Slater's Condition
Slater's condition is the most widely used constraint qualification — a sufficient condition guaranteeing that strong duality holds for convex programs.
For any convex program, if there exists a point $x^* \in \text{relint}(P)$ such that $g_i(x^*) < 0$ for all non-affine $g_i$ and $h_j(x^*) = 0$ for all $j$, then:
- Strong duality holds: $f^* = \hat{f}^*$.
- The dual optimum is attained: there exist $\lambda^*, \nu^*$ achieving $\hat{f}^*$.
In plain terms: there must exist at least one "strictly feasible" point — a point that is in the interior of the feasible region (not just on the boundary). For affine constraints, being on the boundary is fine; only the non-linear inequality constraints need to be strictly satisfied.
Two important consequences deserve highlighting. First, since every feasible point of an LP satisfies all constraints with either equality or strict inequality (and all constraints are affine), strong duality always holds for linear programs. Second, Slater's condition is sufficient but not necessary: the example $\min x$ s.t. $x^2 \le 0$ has strong duality but Slater's condition fails.
The second part of Slater's theorem — that the dual optimum is attained — is equally important in practice. It guarantees that we can actually find dual multipliers achieving the bound, which is needed for algorithms based on KKT conditions.
3.6 Geometric Interpretation
This section provides a beautiful visual way to understand weak duality, strong duality, and Slater's condition using a single geometric construction.
The Set $G$
For a mathematical program with one inequality constraint ($m = 1$, $\ell = 0$ for simplicity), define the set of achievable constraint-value and objective-value pairs:
In the $(u, t)$-plane, the primal optimum $f^*$ is the lowest point on the $t$-axis that the set $G$ achieves when $u \le 0$:
Dual Values as Supporting Hyperplanes
For a given $\lambda \ge 0$, the dual value $\hat{L}(\lambda)$ equals the minimum of $\lambda u + t$ over $G$. Geometrically, the line $\lambda u + t = \hat{L}(\lambda)$ is a non-vertical supporting hyperplane of $G$. Its intersection with the $t$-axis (at $u = 0$) gives $\hat{L}(\lambda)$.
Picture the set $G$ in the $(u, t)$-plane. The primal value $f^*$ is found by looking at where $G$ meets (or approaches) the vertical line $u = 0$ from the left. The dual value $\hat{L}(\lambda)$ is where the supporting line with slope $-\lambda$ meets the $t$-axis. The duality gap is the vertical distance between $f^*$ and $\hat{f}^*$ on the $t$-axis.
The Epigraph Set $A$
Extending $G$ by allowing worse constraint and objective values gives:
This set is always convex for convex programs (since $g_1$ and $f$ are convex). Now $f^* = \inf\{t : (0, t) \in A\}$ and the dual values are still given by supporting hyperplanes of $A$.
Strong duality holds if and only if $A$ has a non-vertical supporting hyperplane at the point $(0, f^*)$. Slater's condition guarantees this by ensuring the boundary of $A$ at $u = 0$ is "smooth enough" for such a hyperplane to exist.
This geometric picture provides an elegant alternative proof of weak duality. Since $(0, f^*) \in A$, for any supporting hyperplane $(\lambda, 1)^\top(u, t) \ge \hat{L}(\lambda)$ we get $\hat{L}(\lambda) \le ({\lambda}, 1)^\top(0, f^*) = f^*$.
3.7 Theorems of Alternatives
Duality theory is not only about optimization — it can also prove that systems of inequalities are infeasible. The idea: if a system has no solution, there must be a "certificate of infeasibility" in the form of dual multipliers.
Weak Alternatives
Two systems are weak alternatives if at most one of them is feasible.
Consider the primal system $g_i(x) \le 0$, $h_j(x) = 0$ versus the dual system $\hat{L}(\lambda, \nu) > 0$, $\lambda \ge 0$. By weak duality, at most one of these can be feasible: if the primal is feasible then $f^* = 0$ (the objective is zero) so $\hat{f}^* \le 0$, ruling out $\hat{L} > 0$.
Strong Alternatives
Two systems are strong alternatives if exactly one of them is feasible.
When the constraints $g_i$ are convex and $h_j$ are affine, and a mild regularity condition holds (a point in $\text{relint}(P)$ satisfying the equalities exists), we upgrade from weak to strong alternatives. For instance, for strict inequalities:
Under a regularity condition, exactly one of the following two systems is feasible:
Either you can strictly satisfy all constraints (the system is "really" feasible), or there exist dual multipliers certifying that the constraints are incompatible. There's no middle ground — this is the convex analogue of Farkas' Lemma from linear programming.
3.8 Applications — The Dual SVM
One of the most impactful applications of duality is the dual formulation of the Support Vector Machine (SVM). Recall from Chapter 2 that the (primal) SVM seeks the maximum-margin separating hyperplane:
Deriving the Dual
Introducing multipliers $\lambda_i \ge 0$ for each constraint and forming the Lagrangian:
Setting partial derivatives to zero gives two conditions: $w = \frac{1}{2}\sum_i \lambda_i y_i x_i$ and $\sum_i \lambda_i y_i = 0$. Substituting back yields the dual:
Why the Dual Formulation is Powerful
The dual has two major advantages over the primal:
- Dimension independence: The primal has $n + 1$ variables (dimension of feature space), while the dual has $m$ variables (number of data points). In high-dimensional settings like image classification where $n \gg m$, the dual is much smaller.
- The kernel trick: The dual and the classification rule depend only on inner products $x_i^\top x_j$ and $x_i^\top z$. This means we can replace these with a kernel function $k(x, y) = \Phi(x) \cdot \Phi(y)$ for some feature map $\Phi$ without ever computing $\Phi$ explicitly. For example, the Gaussian kernel $k(x, y) = e^{-\|x-y\|^2 / 2\sigma^2}$ implicitly maps into an infinite-dimensional space.
Consider the feature map $\Phi: (x_1, x_2)^\top \mapsto (x_1^2, x_2^2, \sqrt{2}\, x_1 x_2)^\top$. Its kernel is simply $k(x, y) = (x^\top y)^2$. So instead of working in 3D, we stay in 2D and use this kernel in the dual — the math handles the "lifting" for us automatically.
Key Takeaways
- The Lagrangian replaces hard indicator-function penalties with soft linear penalties parametrized by multipliers $\lambda, \nu$. The dual function $\hat{L}(\lambda, \nu) = \inf_x L(x, \lambda, \nu)$ is always concave.
- Weak duality ($\hat{f}^* \le f^*$) always holds, giving a lower bound on the primal. This makes the dual valuable even for non-convex problems.
- Strong duality ($\hat{f}^* = f^*$) holds under Slater's condition: a strictly feasible point for non-affine constraints must exist. It always holds for LPs.
- KKT conditions are necessary (under strong duality) and sufficient (for convex programs) for optimality. Most optimization algorithms are methods for finding KKT-points.
- Complementary slackness ($\lambda_i^* g_i(x^*) = 0$) tells us inactive constraints have zero multipliers. This connects structure of the primal to structure of the dual.
- Theorems of alternatives use duality to certify infeasibility: either a system has a solution, or multipliers certifying its impossibility exist.
- The Dual SVM demonstrates practical power: duality reduces dimensionality and enables the kernel trick for non-linear classification.