This chapter is the heart of the first part of the course. It formally defines convex optimization problems — the central objects we will study — and shows why they occupy a sweet spot between generality and tractability. The chapter moves from basic definitions, through key structural theorems, to conic programs, quasiconvex extensions, and real-world applications like Support Vector Machines. Along the way, it builds intuition for recognizing convex problems, reformulating non-obvious ones into convex form, and understanding the landscape of special cases (LP, QP, SOCP, SDP) that form a hierarchy of increasing expressive power.
2.1 Introduction — Mathematical & Convex Optimization Problems
The General Mathematical Program
Before we can define convex programs, we need the broadest template. A mathematical optimization problem is simply the task of finding an input $x \in \mathbb{R}^n$ that makes some objective function as small as possible while respecting constraints.
Given functions $f : \mathbb{R}^n \to \mathbb{R}$ (the objective) and $g_i : \mathbb{R}^n \to \mathbb{R}$ for $i \in [m]$ (the constraints), a mathematical optimization problem is:
Think of this as a recipe: you have a quantity you want to minimize (cost, error, risk) and a set of rules you cannot violate (budget limits, physics, capacity). Linear programming is a special case where both the objective and constraints are affine (i.e., $f(x) = c^\top x$ and $g_i(x) = a_i^\top x - b_i$).
The Convex Optimization Problem
A convex program adds one powerful requirement: both the objective and all constraint functions must be convex. This seemingly simple restriction has profound algorithmic consequences.
Let $f, g_1, \dots, g_m : \mathbb{R}^n \to \mathbb{R}$ be convex functions. The convex optimization problem in general form is:
The feasibility region is an intersection of convex sets — hence convex. Equality constraints of the form $h_j(x) = 0$ are allowed only if $h_j$ is affine (since an affine function is both convex and concave).
Convexity guarantees that the landscape has no "traps": there are no valleys that look optimal locally but are suboptimal globally. This is the key property that makes these problems efficiently solvable.
Theorem: Local = Global Optimality
This is arguably the most important theorem about convex programs. It says: if you've found a local minimum, you've actually found the global minimum.
Let $f : S \to \mathbb{R}$ be convex over a convex set $S$. Every local minimum of $f$ over $S$ is also a global minimum.
Suppose $x$ is a local minimum but not global — so there exists some $y \in S$ with $f(y) < f(x)$. By convexity, points on the line segment from $x$ toward $y$ satisfy $f(\lambda y + (1-\lambda)x) \le \lambda f(y) + (1-\lambda) f(x) < f(x)$. But some of these points are arbitrarily close to $x$, contradicting the fact that $x$ is a local minimum.
A natural follow-up is that the set of optimal solutions itself is convex (Theorem 2.1.4). If both $x$ and $y$ are optimal, then any convex combination $\lambda x + (1-\lambda)y$ is also optimal — you never "jump out" of the optimal set.
The Hierarchy of Special Cases
Convex programs encompass a rich hierarchy of problem classes, each more expressive than the last. Later chapters treat each in detail.
| Class | Objective | Constraints |
|---|---|---|
| LP (Linear Program) | affine | affine |
| QP (Quadratic Program) | convex quadratic | affine |
| QCQP | convex quadratic | convex quadratic |
| SOCP (Second-Order Cone) | affine | $\|A_i x + b_i\|_2 \le c_i^\top x + d_i$ |
| SDP (Semidefinite Program) | affine | linear matrix inequalities |
Each class strictly contains the previous one: LP $\subset$ QP $\subset$ QCQP $\subset$ SOCP $\subset$ SDP $\subset$ general convex programs. This nesting is one of the organizing themes of the entire course.
Abstract Convex Optimization Problems
Sometimes a problem is convex (the feasible set is convex and we're minimizing a convex function), but the constraints aren't written with convex functions. This is called an abstract convex optimization problem. Making it into a true convex program requires rewriting the constraints.
Consider minimizing $x_1^2 + x_2^2$ subject to $x_1/(1+x_2^2)^2 \le 0$ and $(x_1+x_2)^2 = 0$. The constraint $(x_1+x_2)^2 = 0$ is not affine, and the first constraint is not convex. But note that $(x_1+x_2)^2 = 0$ is the same as $x_1 + x_2 = 0$ (which is affine), and $x_1/(1+x_2^2)^2 \le 0$ simplifies to $x_1 \le 0$ on the feasible set. The rewritten problem is a true convex program.
Hidden Convex Problems & the Trust Region Subproblem
Even more surprising: some problems that don't look convex at all can be transformed into convex ones. These are called hidden convex problems. A key example is the Trust Region Subproblem (TRS): minimizing a (possibly non-convex) quadratic function over a ball.
Given a symmetric matrix $A$ (not necessarily PSD), vector $b$, and scalar $c$:
Even though $A$ may have negative eigenvalues (making the objective non-convex), a clever variable substitution using the eigendecomposition $A = UDU^\top$ transforms this into a convex program.
Diagonalize $A$ to decouple the variables. Then use Theorem 2.1.8 (which shows that at optimality $e_i y^*_i \le 0$) to deduce the sign pattern of the optimal solution. This allows a substitution $y_i = -\text{sgn}(e_i)\sqrt{z_i}$ that turns the non-convex quadratic into a convex function of $z \ge 0$.
Equality Constraints with Convex Functions
Normally, equality constraints must be affine for the problem to be convex. However, Proposition 2.1.9 shows a special case where you can replace a convex equality $h(x) = 0$ with the inequality $h(x) \le 0$, provided the optimal solution naturally satisfies $h(x^*) = 0$. The two programs then have the same optimal value.
Algorithms — A Brief Note
Convex programs can be solved in polynomial time (in the encoding length and desired accuracy) using the interior point method, as long as the functions are twice continuously differentiable and the problem is strictly feasible (Slater's condition: there exists an $x$ with $g_i(x) < 0$ for all $i$). The idea is to encode the feasibility set using barrier functions and reduce the problem to a sequence of unconstrained quadratic problems via Newton's method.
2.2 Conic Optimization Problems
Conic programs are a particularly clean and general way to express convex optimization problems. The idea is elegant: instead of writing constraints as $g_i(x) \le 0$ with various convex functions, we require $x$ to belong to a proper cone $K$ (a cone that is convex, closed, solid, and pointed) and optimize a linear objective over the intersection of that cone with an affine subspace.
Let $K \subseteq \mathbb{R}^n$ be a proper cone. A conic program has the following forms:
The feasible set is the intersection of the cone $K$ with the affine subspace $\{x : Ax = b\}$. For an LP, $K$ is the nonnegative orthant $\mathbb{R}^n_+$. For an SOCP, $K$ is a product of second-order (Lorentz) cones. For an SDP, $K$ is the positive semidefinite cone $\mathcal{S}^n_+$. Same template, different cones — that's the power of conic programming.
Generalized Conic Programs on Abstract Vector Spaces
The conic framework extends beyond $\mathbb{R}^n$. Given two real inner product vector spaces $S, T$, a linear map $\mathcal{A}: S \to T$, and a proper cone $K \subseteq S$, we can define a conic program and its dual in full generality. The dual uses the dual cone $K^*$ and the adjoint map $\mathcal{A}^*$.
If $x$ is primal feasible and $y$ is dual feasible, then $\langle c, x \rangle_S \ge \langle y, b \rangle_T$. The proof is short and elegant: use primal feasibility ($\mathcal{A}x = b$), the definition of the adjoint, and the fact that $c - \mathcal{A}^* y \in K^*$ and $x \in K$ means their inner product is nonnegative (by definition of dual cone).
Unlike linear programs, strong duality does not always hold for conic programs. Later chapters will show examples of SDPs where both primal and dual are feasible but have different optimal values. Additional conditions (like Slater's condition) are needed to guarantee zero duality gap.
2.3 Miscellaneous — Extensions and Boundaries of Convexity
2.3.1 Quasiconvex Programs
Not every useful optimization problem has a convex objective. A natural relaxation is quasiconvexity: instead of requiring the function value on any convex combination to be at most the weighted average of the endpoints (convexity), we only require it to be at most the maximum of the two endpoint values.
A function $f : X \to \mathbb{R}$ on a convex set $X$ is quasiconvex if for all $x, y \in X$ and $\lambda \in [0,1]$:
Equivalently: all sublevel sets $\{x : f(x) \le \alpha\}$ are convex.
A convex function's graph "curves upward." A quasiconvex function doesn't need to curve upward — it can be flat or even curve downward — as long as it doesn't create multiple separated "valleys." For instance, the ratio $p(x)/q(x)$ of a convex function $p \ge 0$ and a concave function $q > 0$ is quasiconvex.
The key algorithmic tool for quasiconvex programs is the bisection method. The insight: to check whether the optimal value is $\le t$, we solve the convex feasibility problem $\phi_t(x) \le 0,\; g_i(x) \le 0$, where $\phi_t$ is a convex function whose 0-sublevel set equals the $t$-sublevel set of $f$. By binary searching on $t$, we reduce a quasiconvex program to $O(\log(1/\varepsilon))$ convex feasibility problems.
2.3.2 Maximizing a Convex Function
While minimizing a convex function over a convex set is tractable, maximizing one is in general hard. A convex function achieves its maximum at an extreme point of the domain. If the domain is a polytope with a vertex representation $P = \text{conv}\{p_1, \dots, p_K\}$, you simply evaluate at all vertices. But if you only have a halfspace representation, the number of vertices can be exponential, making the problem intractable in general.
2.3.3 Equality Constraint Elimination
Sometimes it is useful to remove equality constraints from a convex program. If we have $Ax = b$, we can parametrize the affine subspace as $x = x_0 + Nz$, where $x_0$ is a particular solution and $N$ spans the null space of $A$. Substituting into the objective and inequality constraints gives an equivalent problem with no equalities. In practice this isn't always efficient — it may destroy sparsity in $A$ — but it's a useful theoretical tool.
2.4 Applications — Support Vector Machines
The main application discussed in this chapter is the Support Vector Machine (SVM), a classic supervised machine learning method for binary classification. The SVM story unfolds in several stages, each illustrating convex optimization ideas.
Stage 1: The Linear Classification Problem
Given $m$ points $\{x_1, \dots, x_m\} \subset \mathbb{R}^n$ with labels $y_i \in \{-1, +1\}$, find a hyperplane $w^\top x + b = 0$ that separates the two classes. This is a linear feasibility problem:
If the points are linearly separable, there are typically many feasible hyperplanes.
Stage 2: Maximizing the Margin
Among all separating hyperplanes, which one is "best"? The SVM selects the one with the maximum margin — the largest gap between the hyperplane and the nearest data points (called support vectors). This makes the classifier robust to small perturbations.
For the pair $(w, b)$ normalized so that the support vectors satisfy $w^\top x + b = \pm 1$, the margin width is $2/\|w\|$. Therefore, maximizing the margin is equivalent to minimizing $\|w\|^2$, giving us the quadratic program:
The two supporting hyperplanes are $w^\top x + b = 1$ and $w^\top x + b = -1$. The distance between two parallel hyperplanes $w^\top x = c_1$ and $w^\top x = c_2$ is $|c_1 - c_2|/\|w\|$. Here $c_1 - c_2 = 2$, so the margin width is $2/\|w\|$.
Stage 3: Non-linear Classification via Feature Maps
Real data is often not linearly separable. The trick is to apply a feature map $\Phi : \mathbb{R}^n \to \mathbb{R}^{n'}$ that lifts the data into a higher-dimensional space where it becomes linearly separable, then run SVM there.
Consider points on the real line: $\{-2, -1, 1, 2\}$ with labels $\{-1, +1, +1, -1\}$. No line separates them. But mapping $x \mapsto (x, x^2)$ lifts them into 2D, where they become linearly separable — points with small $|x|$ land lower, and points with large $|x|$ land higher.
The challenge is that $\Phi$ might map into a very high-dimensional (even infinite-dimensional) space. The kernel trick, developed in Chapter 3 via duality, addresses this: the dual SVM formulation depends only on inner products $\Phi(x_i)^\top \Phi(x_j)$, which can often be computed directly via a kernel function $k(x, y) = \Phi(x) \cdot \Phi(y)$ without ever computing $\Phi$ explicitly.
2.6 Supplement — Polynomial Optimization
This supplementary section gives a taste of problems beyond convex optimization. Polynomial optimization problems allow the objective and constraints to be polynomials — which are generally non-convex.
For a finite set of polynomials $\mathcal{G} \subseteq \mathbb{R}[x]$, the set $\mathcal{G}_+ = \{x \in \mathbb{R}^n : g(x) \ge 0,\; \forall g \in \mathcal{G}\}$ is a basic closed semialgebraic set. These sets are not necessarily convex, and their description is not unique — adding a globally nonnegative polynomial to $\mathcal{G}$ doesn't change $\mathcal{G}_+$.
Polynomial optimization is NP-hard in general. This can be shown by encoding the NP-hard Maximum Independent Set problem: for each vertex introduce a variable $x_i \in \{0, 1\}$, enforce the binary constraint via $x_i^2 - x_i = 0$ (a polynomial equality), and maximize $\sum x_i$ subject to edge constraints. This is a polynomial program.
The role of convex optimization in this setting is to provide convex relaxations — tractable approximations to NP-hard polynomial problems. These relaxation hierarchies (such as the SOS/Lasserre hierarchy, studied later) provide increasingly tight lower bounds on the true optimal value.
The Big Picture — Where Chapter 2 Fits
Key Takeaways
- Convex programs are the sweet spot — they are general enough to model a huge range of real-world problems, yet tractable enough to solve in polynomial time via interior point methods.
- Local = Global: The foundational property of convex optimization. Any local minimum is global, and the set of all optimal solutions is itself convex.
- Conic programs unify everything: LP, SOCP, and SDP are all conic programs with different choices of cone. The standard form $\min\, c^\top x$ s.t. $Ax = b,\, x \in K$ gives a clean geometric picture: optimizing over the intersection of a cone and an affine subspace.
- Weak duality holds for all conic programs, but strong duality requires additional conditions (like Slater's condition) — unlike LP where strong duality is "free."
- Recognizing hidden convexity is a powerful skill. Problems like the Trust Region Subproblem appear non-convex but can be reformulated as convex programs via clever transformations.
- Quasiconvex programs extend the convex framework: they can be solved via bisection on a family of convex feasibility problems.
- SVM is a canonical application — a quadratic program whose dual (Chapter 3) reveals the kernel trick for non-linear classification.
- Polynomial optimization is NP-hard in general, but convex relaxations provide a principled approach to approximate solutions.