Chapter 2: Convex Optimization Problems

Convex Optimization — Through the Lens of Algorithms and Applications Lecture Summary

This chapter is the heart of the first part of the course. It formally defines convex optimization problems — the central objects we will study — and shows why they occupy a sweet spot between generality and tractability. The chapter moves from basic definitions, through key structural theorems, to conic programs, quasiconvex extensions, and real-world applications like Support Vector Machines. Along the way, it builds intuition for recognizing convex problems, reformulating non-obvious ones into convex form, and understanding the landscape of special cases (LP, QP, SOCP, SDP) that form a hierarchy of increasing expressive power.

2.1 Introduction — Mathematical & Convex Optimization Problems

The General Mathematical Program

Before we can define convex programs, we need the broadest template. A mathematical optimization problem is simply the task of finding an input $x \in \mathbb{R}^n$ that makes some objective function as small as possible while respecting constraints.

Definition 2.1.1 — Mathematical Optimization Problem

Given functions $f : \mathbb{R}^n \to \mathbb{R}$ (the objective) and $g_i : \mathbb{R}^n \to \mathbb{R}$ for $i \in [m]$ (the constraints), a mathematical optimization problem is:

$$\min\; f(x) \quad \text{subject to} \quad g_i(x) \le 0,\; i \in [m],\quad x \in \mathbb{R}^n.$$
Intuition

Think of this as a recipe: you have a quantity you want to minimize (cost, error, risk) and a set of rules you cannot violate (budget limits, physics, capacity). Linear programming is a special case where both the objective and constraints are affine (i.e., $f(x) = c^\top x$ and $g_i(x) = a_i^\top x - b_i$).

The Convex Optimization Problem

A convex program adds one powerful requirement: both the objective and all constraint functions must be convex. This seemingly simple restriction has profound algorithmic consequences.

Definition 2.1.2 — Convex Optimization Problem

Let $f, g_1, \dots, g_m : \mathbb{R}^n \to \mathbb{R}$ be convex functions. The convex optimization problem in general form is:

$$\min\; f(x) \quad \text{subject to} \quad g_i(x) \le 0,\; i \in [m].$$

The feasibility region is an intersection of convex sets — hence convex. Equality constraints of the form $h_j(x) = 0$ are allowed only if $h_j$ is affine (since an affine function is both convex and concave).

Why does convexity matter?

Convexity guarantees that the landscape has no "traps": there are no valleys that look optimal locally but are suboptimal globally. This is the key property that makes these problems efficiently solvable.

Theorem: Local = Global Optimality

This is arguably the most important theorem about convex programs. It says: if you've found a local minimum, you've actually found the global minimum.

Theorem 2.1.3 — Local Minima are Global Minima

Let $f : S \to \mathbb{R}$ be convex over a convex set $S$. Every local minimum of $f$ over $S$ is also a global minimum.

Proof Idea

Suppose $x$ is a local minimum but not global — so there exists some $y \in S$ with $f(y) < f(x)$. By convexity, points on the line segment from $x$ toward $y$ satisfy $f(\lambda y + (1-\lambda)x) \le \lambda f(y) + (1-\lambda) f(x) < f(x)$. But some of these points are arbitrarily close to $x$, contradicting the fact that $x$ is a local minimum.

A natural follow-up is that the set of optimal solutions itself is convex (Theorem 2.1.4). If both $x$ and $y$ are optimal, then any convex combination $\lambda x + (1-\lambda)y$ is also optimal — you never "jump out" of the optimal set.

The Hierarchy of Special Cases

Convex programs encompass a rich hierarchy of problem classes, each more expressive than the last. Later chapters treat each in detail.

Class Objective Constraints
LP (Linear Program)affineaffine
QP (Quadratic Program)convex quadraticaffine
QCQPconvex quadraticconvex quadratic
SOCP (Second-Order Cone)affine$\|A_i x + b_i\|_2 \le c_i^\top x + d_i$
SDP (Semidefinite Program)affinelinear matrix inequalities

Each class strictly contains the previous one: LP $\subset$ QP $\subset$ QCQP $\subset$ SOCP $\subset$ SDP $\subset$ general convex programs. This nesting is one of the organizing themes of the entire course.

Abstract Convex Optimization Problems

Sometimes a problem is convex (the feasible set is convex and we're minimizing a convex function), but the constraints aren't written with convex functions. This is called an abstract convex optimization problem. Making it into a true convex program requires rewriting the constraints.

Example 2.1.5 — Rewriting Constraints

Consider minimizing $x_1^2 + x_2^2$ subject to $x_1/(1+x_2^2)^2 \le 0$ and $(x_1+x_2)^2 = 0$. The constraint $(x_1+x_2)^2 = 0$ is not affine, and the first constraint is not convex. But note that $(x_1+x_2)^2 = 0$ is the same as $x_1 + x_2 = 0$ (which is affine), and $x_1/(1+x_2^2)^2 \le 0$ simplifies to $x_1 \le 0$ on the feasible set. The rewritten problem is a true convex program.

Hidden Convex Problems & the Trust Region Subproblem

Even more surprising: some problems that don't look convex at all can be transformed into convex ones. These are called hidden convex problems. A key example is the Trust Region Subproblem (TRS): minimizing a (possibly non-convex) quadratic function over a ball.

Trust Region Subproblem

Given a symmetric matrix $A$ (not necessarily PSD), vector $b$, and scalar $c$:

$$\min\; x^\top A x + 2b^\top x + c \quad \text{s.t.}\quad \|x\|_2 \le 1.$$

Even though $A$ may have negative eigenvalues (making the objective non-convex), a clever variable substitution using the eigendecomposition $A = UDU^\top$ transforms this into a convex program.

How the transformation works (intuition)

Diagonalize $A$ to decouple the variables. Then use Theorem 2.1.8 (which shows that at optimality $e_i y^*_i \le 0$) to deduce the sign pattern of the optimal solution. This allows a substitution $y_i = -\text{sgn}(e_i)\sqrt{z_i}$ that turns the non-convex quadratic into a convex function of $z \ge 0$.

Equality Constraints with Convex Functions

Normally, equality constraints must be affine for the problem to be convex. However, Proposition 2.1.9 shows a special case where you can replace a convex equality $h(x) = 0$ with the inequality $h(x) \le 0$, provided the optimal solution naturally satisfies $h(x^*) = 0$. The two programs then have the same optimal value.

Algorithms — A Brief Note

Convex programs can be solved in polynomial time (in the encoding length and desired accuracy) using the interior point method, as long as the functions are twice continuously differentiable and the problem is strictly feasible (Slater's condition: there exists an $x$ with $g_i(x) < 0$ for all $i$). The idea is to encode the feasibility set using barrier functions and reduce the problem to a sequence of unconstrained quadratic problems via Newton's method.

2.2 Conic Optimization Problems

Conic programs are a particularly clean and general way to express convex optimization problems. The idea is elegant: instead of writing constraints as $g_i(x) \le 0$ with various convex functions, we require $x$ to belong to a proper cone $K$ (a cone that is convex, closed, solid, and pointed) and optimize a linear objective over the intersection of that cone with an affine subspace.

Definition 2.2.1 — Conic Program

Let $K \subseteq \mathbb{R}^n$ be a proper cone. A conic program has the following forms:

$$\textbf{General: } \min\; c^\top x \;\;\text{s.t.}\;\; -Fx - g \in K \qquad\qquad \textbf{Standard: } \min\; c^\top x \;\;\text{s.t.}\;\; Ax = b,\; x \in K.$$
Geometric Picture (Standard Form)

The feasible set is the intersection of the cone $K$ with the affine subspace $\{x : Ax = b\}$. For an LP, $K$ is the nonnegative orthant $\mathbb{R}^n_+$. For an SOCP, $K$ is a product of second-order (Lorentz) cones. For an SDP, $K$ is the positive semidefinite cone $\mathcal{S}^n_+$. Same template, different cones — that's the power of conic programming.

Generalized Conic Programs on Abstract Vector Spaces

The conic framework extends beyond $\mathbb{R}^n$. Given two real inner product vector spaces $S, T$, a linear map $\mathcal{A}: S \to T$, and a proper cone $K \subseteq S$, we can define a conic program and its dual in full generality. The dual uses the dual cone $K^*$ and the adjoint map $\mathcal{A}^*$.

Definition 2.2.5 — Conic Program Duality (Standard Form)
$$\textbf{Primal: } \min\; \langle c, x \rangle_S \;\;\text{s.t.}\;\; \mathcal{A}x = b,\; x \in K \qquad\qquad \textbf{Dual: } \max\; \langle y, b \rangle_T \;\;\text{s.t.}\;\; c - \mathcal{A}^* y \in K^*$$
Theorem 2.2.6 — Weak Duality for Conic Programs

If $x$ is primal feasible and $y$ is dual feasible, then $\langle c, x \rangle_S \ge \langle y, b \rangle_T$. The proof is short and elegant: use primal feasibility ($\mathcal{A}x = b$), the definition of the adjoint, and the fact that $c - \mathcal{A}^* y \in K^*$ and $x \in K$ means their inner product is nonnegative (by definition of dual cone).

Important Caveat

Unlike linear programs, strong duality does not always hold for conic programs. Later chapters will show examples of SDPs where both primal and dual are feasible but have different optimal values. Additional conditions (like Slater's condition) are needed to guarantee zero duality gap.

2.3 Miscellaneous — Extensions and Boundaries of Convexity

2.3.1 Quasiconvex Programs

Not every useful optimization problem has a convex objective. A natural relaxation is quasiconvexity: instead of requiring the function value on any convex combination to be at most the weighted average of the endpoints (convexity), we only require it to be at most the maximum of the two endpoint values.

Definition 2.3.1 — Quasiconvex Function

A function $f : X \to \mathbb{R}$ on a convex set $X$ is quasiconvex if for all $x, y \in X$ and $\lambda \in [0,1]$:

$$f(\lambda x + (1-\lambda)y) \le \max\{f(x), f(y)\}.$$

Equivalently: all sublevel sets $\{x : f(x) \le \alpha\}$ are convex.

Intuition

A convex function's graph "curves upward." A quasiconvex function doesn't need to curve upward — it can be flat or even curve downward — as long as it doesn't create multiple separated "valleys." For instance, the ratio $p(x)/q(x)$ of a convex function $p \ge 0$ and a concave function $q > 0$ is quasiconvex.

The key algorithmic tool for quasiconvex programs is the bisection method. The insight: to check whether the optimal value is $\le t$, we solve the convex feasibility problem $\phi_t(x) \le 0,\; g_i(x) \le 0$, where $\phi_t$ is a convex function whose 0-sublevel set equals the $t$-sublevel set of $f$. By binary searching on $t$, we reduce a quasiconvex program to $O(\log(1/\varepsilon))$ convex feasibility problems.

2.3.2 Maximizing a Convex Function

While minimizing a convex function over a convex set is tractable, maximizing one is in general hard. A convex function achieves its maximum at an extreme point of the domain. If the domain is a polytope with a vertex representation $P = \text{conv}\{p_1, \dots, p_K\}$, you simply evaluate at all vertices. But if you only have a halfspace representation, the number of vertices can be exponential, making the problem intractable in general.

2.3.3 Equality Constraint Elimination

Sometimes it is useful to remove equality constraints from a convex program. If we have $Ax = b$, we can parametrize the affine subspace as $x = x_0 + Nz$, where $x_0$ is a particular solution and $N$ spans the null space of $A$. Substituting into the objective and inequality constraints gives an equivalent problem with no equalities. In practice this isn't always efficient — it may destroy sparsity in $A$ — but it's a useful theoretical tool.

2.4 Applications — Support Vector Machines

The main application discussed in this chapter is the Support Vector Machine (SVM), a classic supervised machine learning method for binary classification. The SVM story unfolds in several stages, each illustrating convex optimization ideas.

Stage 1: The Linear Classification Problem

Given $m$ points $\{x_1, \dots, x_m\} \subset \mathbb{R}^n$ with labels $y_i \in \{-1, +1\}$, find a hyperplane $w^\top x + b = 0$ that separates the two classes. This is a linear feasibility problem:

$$\min\; 0 \qquad \text{s.t.}\quad w^\top x_i + b \ge 1 \;\;\text{if } y_i = 1, \qquad w^\top x_i + b \le -1 \;\;\text{if } y_i = -1.$$

If the points are linearly separable, there are typically many feasible hyperplanes.

Stage 2: Maximizing the Margin

Among all separating hyperplanes, which one is "best"? The SVM selects the one with the maximum margin — the largest gap between the hyperplane and the nearest data points (called support vectors). This makes the classifier robust to small perturbations.

Theorem 2.4.3 — SVM Margin

For the pair $(w, b)$ normalized so that the support vectors satisfy $w^\top x + b = \pm 1$, the margin width is $2/\|w\|$. Therefore, maximizing the margin is equivalent to minimizing $\|w\|^2$, giving us the quadratic program:

$$\min\; \|w\|^2 \qquad \text{s.t.}\quad y_i(w^\top x_i + b) \ge 1 \;\;\text{for all } i \in [m].$$
Why $2/\|w\|$?

The two supporting hyperplanes are $w^\top x + b = 1$ and $w^\top x + b = -1$. The distance between two parallel hyperplanes $w^\top x = c_1$ and $w^\top x = c_2$ is $|c_1 - c_2|/\|w\|$. Here $c_1 - c_2 = 2$, so the margin width is $2/\|w\|$.

Stage 3: Non-linear Classification via Feature Maps

Real data is often not linearly separable. The trick is to apply a feature map $\Phi : \mathbb{R}^n \to \mathbb{R}^{n'}$ that lifts the data into a higher-dimensional space where it becomes linearly separable, then run SVM there.

Example — 1D Non-linear Separation

Consider points on the real line: $\{-2, -1, 1, 2\}$ with labels $\{-1, +1, +1, -1\}$. No line separates them. But mapping $x \mapsto (x, x^2)$ lifts them into 2D, where they become linearly separable — points with small $|x|$ land lower, and points with large $|x|$ land higher.

The challenge is that $\Phi$ might map into a very high-dimensional (even infinite-dimensional) space. The kernel trick, developed in Chapter 3 via duality, addresses this: the dual SVM formulation depends only on inner products $\Phi(x_i)^\top \Phi(x_j)$, which can often be computed directly via a kernel function $k(x, y) = \Phi(x) \cdot \Phi(y)$ without ever computing $\Phi$ explicitly.

2.6 Supplement — Polynomial Optimization

This supplementary section gives a taste of problems beyond convex optimization. Polynomial optimization problems allow the objective and constraints to be polynomials — which are generally non-convex.

Definition 2.6.1 — Basic Closed Semialgebraic Set

For a finite set of polynomials $\mathcal{G} \subseteq \mathbb{R}[x]$, the set $\mathcal{G}_+ = \{x \in \mathbb{R}^n : g(x) \ge 0,\; \forall g \in \mathcal{G}\}$ is a basic closed semialgebraic set. These sets are not necessarily convex, and their description is not unique — adding a globally nonnegative polynomial to $\mathcal{G}$ doesn't change $\mathcal{G}_+$.

NP-hardness

Polynomial optimization is NP-hard in general. This can be shown by encoding the NP-hard Maximum Independent Set problem: for each vertex introduce a variable $x_i \in \{0, 1\}$, enforce the binary constraint via $x_i^2 - x_i = 0$ (a polynomial equality), and maximize $\sum x_i$ subject to edge constraints. This is a polynomial program.

The role of convex optimization in this setting is to provide convex relaxations — tractable approximations to NP-hard polynomial problems. These relaxation hierarchies (such as the SOS/Lasserre hierarchy, studied later) provide increasingly tight lower bounds on the true optimal value.

The Big Picture — Where Chapter 2 Fits

General Mathematical Programs Polynomial Optimization (NP-hard) Convex Programs (Ch. 2) SDP ⊃ SOCP ⊃ QCQP ⊃ QP Linear Programs (LP) harder harder more restrictive more restrictive ← poly-time solvable ← intractable (in general)
Figure 1: The hierarchy of optimization problem classes. Chapter 2 focuses on the highlighted level — convex programs — the boundary between tractability and intractability.

Key Takeaways