3. Convex Analysis

3. Convex Analysis

3.1 Prove convex set:

  1. $x,y\in C$, then prove $\lambda x + (1-\lambda y) \in C$
  2. intersection of convex sets are convex
  3. $f$ is convex $\iff$ epigraph set is convex $f(x)\le \alpha$

3.2 Prove convex function:

  1. Prove $f(\lambda x + (1-\lambda)y) \le \lambda f(x) + (1-\lambda) y$
  2. (sum of) linear combination of convex function is convex
  3. $h$ convex, $g$ non-decreasing convex, $g\circ h$ is convex
  4. epigraph set is convex $f(x)\le \alpha$ $\iff$ $f$ is convex
  5. convex, then $x=\sum_{k=1}^k \lambda_jx^{j}$, where $\sum_{j=1}^k \lambda_j =1$, we have $f(x)\le \sum_{k=1}^k \lambda_j f(x^j)$ Jensen’s inequality:
  6. convex $\iff$ $f(y) \le f(x)+\nabla f(x)^T (y-x)$ Tangent plane characterization:
  7. convex $\iff$ $\langle \nabla f(\mathbb y)-\nabla f(\mathbb x), \mathbb y - \mathbb x \rangle \ge 0, \quad \forall \mathbb x, \mathbb y \in S$ monotone gradient condition
  8. convex $\iff$ Hessian is positive semidefinite

For constrained convex programming $\min{f(x)|x\in C}$, $x^*$ is global minimizer $\iff$ $\langle \triangledown f(x^*), x-x^* \rangle > 0, \forall x \in C$.

3.3 Projection: $\Pi_C(z) = \mathrm{argmin} { \frac{1}{2} |x-z|^2 \vert x\in C }$

  • $x^* = \Pi_C(z)$ $\iff$ $\langle z-x^*, x-x^* \rangle \leq 0, \forall x\in C$
  • $| \Pi_C(z) - \Pi_C(w) | \leq |z-w |$.
  • C is linear subspace, $z-x^*\perp C$, then $z = \Pi_C(z) + (z-\Pi_C(z)), \quad \text{and } \quad \langle z-\Pi_C(z), \Pi_C(z)\rangle = 0$
  • C is closed convex cone, $\langle z-\Pi_C(z), \Pi_C(z)\rangle = 0$
  • If $\theta (x) = \frac{1}{2} | x-\Pi_C(x)|^2$, then $\theta(\cdot)$ is a continuously differentiable convex function and $\nabla \theta(x) = x-\Pi_C(x)$

3.4 Cones:

  • Dual cone: $S^* = { y\in S | \langle x,y \rangle \ge 0, \forall x\in S }$. Self-dual cone $C^*=C$, including $\mathbb S_+^n$, $\mathbb R^n$,
  • Polar cone: $S^\circ = -S^*$
  • $(C^*)^*=C$.
  • Spectral cone $\mathcal K^*$, Nuclear-norm cone $\mathcal C$, $\mathrm {COP}_n$, $\mathrm {CPP}_n$, double nonnegative cone $\mathbb D_n$. [Chapter 3, P79]
  • $z-\Pi_C(z) = \Pi_{C^\circ}(z)$, if $C$ is closed convex cone.
  • Normal cone:
    • $N_C(\bar x) = {z\in \mathcal E | \langle z, x-\bar x \rangle \leq 0, \forall x\in C}$ where $\bar x \in C$.
    • $u\in N_C(y) \iff y=\Pi_C(y+u)$
    • $N_{C_1} (\bar x) + N_{C_2} (\bar x) \subset N_{C_1\cap C_2} (\bar x)$ where $\bar x \in C_1 \cap C_2$.
    • $C=\mathcal S^n_+$. $X=U_1D_1U_1^T, D_1 = \mathrm{diag}(d_1, \cdots, d_r)$ with descending order. $N_C(X)={-U_2\Delta U_2^T}$. [P81]

3.5 Subgradient:

  • $\langle v-u, y-x \rangle \ge 0, \quad \forall u\in \part f(x), v\in \part f(y)$ Monotone mapping

  • $\part \delta_C(x) = \begin{cases}
    \empty \quad & \text{if }x\notin C \
    \mathcal N_C(x)= \text{normal cone of $C$ at $x$} \quad & \text{if }x\in C
    \end{cases}$ where $C$ is convex subset of $\mathbb R^n$

  • $\part f(0) = { y\in \mathbb R^n | |y|_\infty \le 1 }$, if $f(x) = | x |_1$ [P85]

3.6 Frenchel Conjugate:

$$
f^*(y) = \sup { \langle y,x \rangle - f(x) | x\in \mathcal E }, y\in \mathcal \ E
$$

  • $f^*$ is always closed and convex, even if $f$ is non-convex or not closed.

  • Equivalent condition

    1. $f(x) + f^*(x^*) = \langle x, x^* \rangle$
    2. $x^* \in \part f(x)$
    3. $x \in \part f^*(x^*)$
    4. $\langle x, x^* \rangle - f(x) =\max_{z\in \mathcal E} {\langle z, x^* \rangle - f(z)}$
    5. $\langle x, x^* \rangle - f^*(x^*) =\max_{z^\in \mathcal E} {\langle x, z^ \rangle - f^*(z^*)}$
    6. $(f^*)^* = f$

3.7 Moreau-Yosida regularization and proximal mapping

$$
\begin{array}{ll}{\text { MY regularization of } f \text { at } x:} & {M_{f}(x)=\min {y \in \mathcal{E}}\left{\phi(y ; x):=f(y)+\frac{1}{2}|y-x|^{2}\right}} \ {\text { Proximal mapping } f \text { at } x:} & {P{f}(x)=\operatorname{argmin}_{y \in \mathcal{E}}{\phi(y ; x):=f(y)+\frac{1}{2}|y-x|^{2}}}\end{array}
$$

  • $x=P_f(x)+P_{f^*}(x), \forall x\in \mathcal E$
  • $f(x)=\lambda |x|_1$, $f^*(z)=\delta_C(z)$ where $C=\part f(0)={z\in \mathbb R^n | |z|_\infty \le \lambda }$ [P91]
  • If $f=\delta_C$, $P_{\delta_{C}}(x)=\operatorname{argmin}{y \in \mathcal{E}}\left{\delta{C}(y)+\frac{1}{2}|y-x|\right}=\operatorname{argmin}{y \in C} \frac{1}{2}|y-x|^{2}=\Pi{C}(x)$. [P92]
  • if $C=\mathbb S_+^n$, Then $\Pi_C(x)=Q\mathrm{diag}d_+ Q^T \quad \text{using spectral decomposition } x=Q \mathrm{diag}dQ^T$

4. Gradient Methods

Descent direction: direction $\mathbf d$ such that $\langle \nabla f(x^*), d\rangle <0 $.

Steepest decent direction: direction $\mathbf d$ such that $\mathbf d^* = - \nabla f(x^*)$.

  1. Steepest descent method with exact line search: + Convergence rate

  2. Accelerated Proximal Gradient method: + Error, complexity

    1. Sparse regression [P112]
    2. Projection on to convex cone $\mathrm{DNN}n^* = \mathbb S+^n + \mathcal N^n$ [P112]
  3. Gradient Projection Method:

    $x^{k+1} = P_Q(x^k - \alpha_k \nabla f(x^k))$, where $x(\alpha) = x-\alpha \nabla f(x)$, $x_Q(\alpha) = P_!(x(\alpha))$

  4. Stochastic Gradient Descent Method

5. Basic NLP (KKT)

(LICQ): Linear Independency Constraint Qualification : regular point : $\nabla h_i$ and $\nabla g_i$ are linear independent.

KKT necessary condition:

  • 1st order:

    $\begin{array}{l}{[1] \quad \nabla f\left(\mathrm{x}^{}\right)+\sum_{i=1}^{m} \lambda_{i}^{} \nabla g_{i}\left(\mathrm{x}^{}\right)+\sum_{j=1}^{p} \mu_{j}^{} \nabla h_{j}\left(\mathrm{x}^{}\right)=0} \ {[2] \quad g_{i}\left(\mathrm{x}^{}\right)=0, \quad \forall i=1,2, \cdots, m} \ {[3] \quad \mu_{j}^{} \geq 0, h_{j}\left(\mathrm{x}^{}\right) \leq 0, \quad \mu_{j}^{} h_{j}\left(\mathrm{x}^{}\right)=0, \quad \forall j=1,2, \cdots, p}\end{array}$

  • 2nd order: $y^T H_L(x^*) y \ge 0$ where $y\in T(x^*)$ tangent plane

    • Easier check: $Z(x^*)H_L(x^*)Z(x^*)$ is positive semidefinite

Steps:

  1. Check regularity condition
  2. verify KKT first order necessary condition.
  3. verify KKT second order necessary condition
    1. Hessian matrix
    2. basis of null space of $\mathcal D(\mathbb x^*)^T$
    3. definiteness check

Projection onto simplex: [P139-140]

relax multipliers

KKT sufficient conditions: 1st + 2nd

steps:

  1. verify it’s KKT point.
  2. verify it satisfy the second order sufficient condition
    1. Hessian matrix
    2. $Z(\mathbf x^*)$
    3. Determine definiteness

Important points:

  1. KKT point is an optimal solution under convexity, but a global minimizer of a convex program may not be a KKT point.

  2. With regularity condition, a global minimizer is a KKT point. For convex programming problem with at least one inequality constraints, the Slater’s condition ensures that a global minimizer is a KKT point.

    slater’s condition: there exists $\hat{\mathbb x} ∈ \mathbb R^n$ such that $g_i(\hat{\mathbb x}) = 0, ∀ i = 1,\cdots, m$ and $h_j(\hat{\mathbb x}) < 0, ∀ j = 1, . . . , p$.

     Slater's condition states that the feasible region must have an [interior point](https://en.wikipedia.org/wiki/Interior_(topology)).
    
  3. Linear equality constrained problem:

    a point $x^∗ ∈ S$ is a KKT point $\iff$ $x^∗$ is a global minimizer of $f$ on $S$.

    No regularity or slater condition is needed.

  4. LInear + convex quadratic programming.

6. Lagrangian duality

optimal value of dual problem is smaller than optimal value of primal problem. Duality gap;

Saddle point optimality $\longleftrightarrow$ KKT

  1. For a fixed $\mathbb x^∗$, $(λ^∗, μ^∗)$ maximizes $L(\mathbb x^∗, λ, μ)$ over all $(λ, μ)$ with $μ ≥ 0$.
  2. For a fixed $(λ^∗, μ^∗)$ , $\mathbb{x}^*$ minimizes $L(\mathbb x, λ^∗, μ^∗)$ over all $x ∈ X$.
  3. saddle point $(\mathbb x^*, \lambda^*,\mu^*)$ $\Rightarrow$ KKT condition. but inverse is not true in general.
  4. KKT point of a convex program is a saddle point.

7. Nonlinear conic programming

Sparse regression problem:

  • objective function is convex

  • APG: [P112]

  • KKT: [P153-154]

  1. Chapter 7. conic programming

  2. Chapter 8. sPADMM

    P214

Indicator function:

  • Indicator function: $\delta_C(x) = \begin{cases} 0, & x\in C \ \infty, & x\notin C \end{cases}$

$$
\delta_{S_+^n}^* = \delta_{-S_+^n}
$$

$$
\mathrm{Prof}_f(x) = (I+\lambda \part f)^{-1}
$$

$P_f(x) = x-P_{f^*}(x)$

Proximal $P_f$ and Projection $\Pi_C(f(x))$ : When $f=\delta$ indicator function, the two operator are equivalent.

$f(x) = \lambda |x|_1 $, then $f^*(z) = \delta_C(z)$ where $C=\part f(0) = {z\in\mathbb R^n | |z|_\infty \le \lambda }$, the proximal function $P_f(x) = \mathrm {sgn} (x) \circ \max{|x|-\lambda }$ [P91-92]

$f=\delta_C$, $C$ is closed and convex, the proximal and projection operator are equivalent, because $P_{\delta_{C}}(x)=\operatorname{argmin}{y \in \mathcal{E}}\left{\delta{C}(y)+\frac{1}{2}|y-x|\right}=\operatorname{argmin}{y \in C} \frac{1}{2}|y-x|^{2}=\Pi{C}(x)$. When $C=\mathbb S_+^n$, then $\Pi_C(x)=Q \mathrm {diag}(d_+) Q^T$ using spectral decomposition $x=Q\mathrm {diag}(d) Q^T$

$f(x)=|x|_*$, then the $P_f(x) = V\Sigma^+ U^T$.

$x^* = P_f(x^*)$ will converge to the optimum of $f$, i.e., $x^*$.

Dual norm: [P88]

  • $f(x)=\lambda |x|1$, then $f^*(x) = \delta{B_\lambda}$, where $B_\lambda = {x\in \mathbb R^n | |x|_\infty\leq \lambda}$.
  • $f(x)=\lambda |x|\infty$, then $f^*(x) = \delta{B_\lambda}$, where $B_\lambda = {x\in \mathbb R^n | |x|_1\leq \lambda}$.
  • $f(x)=\lambda |x|m$, then $f^*(x) = \delta{B_\lambda}$, where $B_\lambda = {x\in \mathbb R^n | |x|_n\leq \lambda}$, $\frac{1}{m}+\frac{1}{n}=1$, called dual norm.
  • $\ell_2$ norm is self-dual norm.

$y\in N_C(x)$, then $y\in \Pi_C(x+y)$

$x=P_f(x)+P_{f^*}(x), \forall x\in \mathcal E$

$G = \Pi_{\mathbb S_+^*}(G) - \Pi_{\mathbb S_+^n}(-G)$

$\min{ \frac{1}{2}|G+Z|^2 | Z\in \mathbb S_+^n} = \frac{1}{2} | G+\Pi_{\mathbb S^n_+}(G)|^2 = \frac{1}{2} |\Pi_{\mathbb S_+^n}(G) |^2$

  1. $\mathbb x \in S \iff \mathbb x = \mathbb x^* + \mathbb z$ for some $\mathbb z \in \operatorname{Null}(A)$. 【space of x】
  2. $\mathbb y^T \mathbb z = 0$, $\forall \mathbb z \in \operatorname{Null}(\mathbf A) \iff \exist \lambda^* \in \mathbb R^m$ s.t. $\mathbb y+ \mathbf A^T \lambda^*=0$.

The linear map $\mathcal A(X) = \begin{bmatrix} \langle A_1, X \rangle \ \vdots \ \langle A_m, X \rangle \end{bmatrix}$. Let $\mathcal A^* $ be adjoint of $\mathcal A$ as $\mathcal A^* y = \sum\limits_{k=1}^m y_kA_k, y\in \mathbb R^m$.
$$
\langle \mathcal A^* y, X \rangle = \langle y, \mathcal A X \rangle
$$
if $\mathcal A(X) = \operatorname{diag} (X)$, then $A_k = E_{kk}$, and $\mathcal A^* y = \operatorname{diag} (y)$.

Dual problems:

  • Linear programs + dual problem:

  • Quadratic program + dual problem

  • Square function + Linear inequality constraints [P177]

  • NLP + dual problem [P178,180]

  • SDP + Dual Problem [P178, P180]

  • SDPLS + dual problem [P179, ]

  • Quadratic Semidefinite problem + Dual problem

  • Convex quadratic composite optimization problem + dual problem:

  • d