4. Gradient Method

4.1 Unconstrained Optimization

Descent direction: direction $\mathbf d$ such that $\langle \nabla f(x^*), d\rangle <0 $.

Steepest decent direction: direction $\mathbf d$ such that $\mathbf d^* = - \nabla f(x^*)$.

Steepest descent method with exact line search:

[step 0]

[step k]

steepest descent method follows a zig-zag path towards $\mathbf x^*$ with the right angle at each turn. $\longrightarrow$ slow

Theorem:

Steepest descent method with exact line search moves in perpendicular steps. More precisely, if $\mathbf x^{(k)}$ is a steepest descent sequence for a function $f(\mathbf x)$, then for each $k$, the vector joining $\mathbf x^{(k)}$ to $\mathbf x^{(k+1)}$ (i.e. $\mathbf x^{(k+1)} - \mathbf x^{(k)}$) is orthogonal (perpendicular) to the vector joining $\mathbf x^{(k+1)}$ to $\mathbf x^{(k+2)}$ (i.e. $\mathbf x^{(k+2)}-\mathbf x^{(k+1)}$).

Property:

  • monotonic decreasing property
  • convergence
    • limit of any convergent subsequence of ${\mathbf x^{(k)}}$ is a critical point of $f(\mathbf x)$.

4.1.1 convergence rate for unconstrained quadratic minimization

$$
\min_{\mathbf x\in \mathbb R^n} q(\mathbf x)=\frac{1}{2}\mathbf x^T \mathbf Q \mathbf x
$$

Proposition:

$\mathbf Q$ is symmetric and positive definite, use steepest descent method with exact line search,

  1. Let $\mathbf d^k = \nabla q(\mathbf x^k)=\mathbf Q \mathbf x^k$.
    $$
    \frac{q\left(\mathrm{x}^{k+1}\right)}{q\left(\mathrm{x}^{k}\right)}= 1-\frac{\left\langle\mathrm{d}^{k}, \mathrm{d}^{k}\right\rangle^{2}}{\left\langle\mathrm{d}^{k}, \mathrm{Qd}^{k}\right\rangle\left\langle\mathrm{d}^{k}, \mathrm{Q}^{-1} \mathrm{d}^{k}\right\rangle}
    $$

  2. $$
    \frac{q\left(\mathrm{x}^{(k+1)}\right)}{q\left(\mathrm{x}^{(k)}\right)} \leq\left[\frac{\kappa(\mathrm{Q})-1}{\kappa(\mathrm{Q})+1}\right]^{2}=: \rho(\mathrm{Q})
    $$

    where $\kappa(\mathrm{Q})=\lambda_n / \lambda_1$, $\lambda_n, \lambda_1$ are the largest and smallest eigenvalues of $\mathbf Q$.

Remark.

  • the convergence rate $\rho(\mathbf Q)$ depends on $\kappa(\mathbf Q)$. When $\kappa(\mathbf Q)$ is large, the convergence rate
    $$
    \rho(\mathbf Q) \approx 1-\frac{4}{\kappa(\mathbf Q)}
    $$

  • In $\mathbb R^2$, the contours of $q(\mathbf x)$ are ellipses. And $\kappa(\mathbf Q)$ is the ratio between the length of the principal axes of the ellipses. Thus the larger the value of $\kappa(\mathbf Q)$, the more elongated the ellipses are.

  • The number of iterations needed to reduce the relative error $q(\mathbf x^k)/q(\mathbf x^0)$ to smaller than $\epsilon$ is given by
    $$
    k=\left[ \frac{\log \epsilon}{\log \rho(\mathbf Q)} \right] + 1
    $$
    where $[a]$ denotes the largest integer less than or equal to $a$.

4.1.2 Convergence rate for strongly convex function

$f$: strongly convex with parameter $m$ and has $M$-Lipschitz continuous gradient. Its Hessian:
$$
mI \preceq H_f (\mathbf x) \preceq MI, \forall \mathbf x\in S
$$
Lemma:

$\mathbf x^*$ is minimizer of $\min{f(\mathbf x)| \mathbf x \in S}$. Then
$$
f(\mathrm{x})-\frac{1}{2 m}|\nabla f(\mathrm{x})|^{2} \leq f\left(\mathrm{x}^{*}\right) \leq f(\mathrm{x})-\frac{1}{2 M}|\nabla f(\mathrm{x})|^{2} \quad \forall \mathrm{x} \in S
$$
Theorem:

Let $\mathbf x^*$ be unique minimizer. $E_k = f(\mathbf x^k) - f(\mathbf x^*)$. Then
$$
\begin{aligned} E_{k+1} & \leq E_{k}-\frac{1}{2 M}\left|\nabla f\left(\mathrm{x}^{k}\right)\right|^{2} \quad \text { (descent inequality) } \ E_{k+1} & \leq E_{k}\left(1-\frac{m}{M}\right) \end{aligned}
$$
Remark:

based on this theorem, we have
$$
{\qquad E\left(\mathrm{x}^{k+1}\right) / E\left(\mathrm{x}^{1}\right) \leq(1-m / M)^{k} \leq \varepsilon}
$$
implies that we need the number of iterations $k$ to satisfy
$$
{\left.\qquad k \geq \frac{\log \varepsilon^{-1}}{\log \rho^{-1}} \approx \frac{m}{M} \log \varepsilon^{-1} \quad \text { (if } m / M \ll 1\right)}
$$
where $\rho = 1-m/M$.

4.2Line search strategies

  1. minimization rule = exact line search
  2. Armijo rule
  3. nonmonotone line search
  4. monotone line search , BB stepsize

4.3 Accelerated proximal gradient

smooth convex function $f$ with $L$-Lipstchitz continuous gradient
$$
\min { F(x):= f(x)+g(x) | x\in \mathbb R^n }
$$
$g$ is proper closed convex function (maybe non-differentiable).

Problem:
$$
\hat x = \mathrm{argmin} {g(x)+q(x;\bar x) | x\in \mathbb R^n }
$$
where $q(x;\bar x) = f(\bar x) + \langle \nabla f(\bar x), x-\bar x\rangle + \frac{1}{2} \langle x-\bar x, H(x-\bar x) \rangle $.

APG method

[step 1]

[step 2]

Lemma:

Assume $f(\hat x) \le q(\hat x; \bar x)$. The descent property:
$$
F(x)+\frac{1}{2}|x-\bar{x}|{H}^{2} \geq F(\hat{x})+\frac{1}{2}|x-\hat{x}|{H}^{2} \quad \forall x \in \mathbb{R}^{n}
$$
(the assumption $\Longrightarrow$ 1-Lipschitz continuous )

Theorem

Error

complexity

Two for PG and APG

example:

  1. Sparse regression problem $\min{\frac{1}{2} |Ax-b|^2 + \rho|x|_1 | x\in \mathbb R^n }$. Let $f(x) = \frac{1}{2} |Ax-b|^2$ and $g(x) = \rho|x|1$. $\nabla f(x) = A^T(Ax-b)$ is Lipschitz continuous with modulus $L=\lambda{\max}(AA^T)$. Let $H=LI_n$, the APG method subproblem is given by
    $$
    x^{k+1}=\operatorname{argmin}\left{g(x)+\left\langle\nabla f\left(\bar{x}^{k}\right), x-\bar{x}^{k}\right\rangle+\frac{L}{2}\left|x-\bar{x}^{k}\right|^{2} | x \in \mathbb{R}^{n}\right}
    $$

  2. given $G\in \mathbb S^n$. Projection onto the closed convex cone $\mathrm{DNN}n^* = \mathbb S+^n + \mathcal N^n$:
    $$
    \begin{array}{l}{\min \left{\frac{1}{2}|S+Z-G|^{2} | S \in \mathbb{S}{+}^{n}, Z \in \mathcal{N}^n \right} }\ {=\min \left{\frac{1}{2}\left|\Pi{\mathcal{N}^{n}}(S-G)\right|^{2} | S \in \mathbb{S}{+}^{n}\right}} \ {=\min \left{f(S)+\delta{\mathrm{S}{+}^{n}}(S) | S \in \mathbb{S}^{n}\right}}\end{array}
    $$
    where $f(S)=\frac{1}{2} | \Pi
    {\mathcal N^n} (S-G) |^2$. The optimality condition for a minimizer $\bar S$ of the last minimization problem is given as follows:
    $$
    \begin{array}{l}{0 \in \partial\left(f+\delta_{\mathrm{S}{+}^{n}}\right)(\bar{S})=\partial f(\bar{S})+\partial \delta{\mathrm{S}{+}}^{\mathrm{n}}(\bar{S})} \ {\Leftrightarrow-\nabla f(\bar{S}) \in \partial \delta{\mathrm{S}{+}}^{n}(\bar{S})} \ {\Leftrightarrow-\nabla f(\bar{S}) \in\left(I+\partial \delta{\mathrm{S}{+}}\right)(\bar{S})} \ {\Leftrightarrow \bar{S}=\left(I+\partial \delta{\mathrm{S}{+}}\right)^{-1}(\bar{S}-\nabla f(\bar{S}))=\Pi{\mathrm{S}{+}^{n}}(\bar{S}-\nabla f(\bar{S}))} \
    \end{array}
    $$
    we have $\nabla f(S) = \Pi
    {mathcal N^n}(S-G)$, and
    $$
    {\begin{aligned}|\nabla f(S)-\nabla f(S)| &=\left|\Pi_{N^{n}}(S-G)-\Pi_{\mathcal{N}^{n}}\left(S^{\prime}-G\right)\right| \ & \leq\left|(S-G)-\left(S^{\prime}-G\right)\right|=\left|S-S^{\prime}\right| \end{aligned}}
    $$
    Thus $∇f(·)$ is Lipschitz continuous with modulus $L = 1$. Pick $H = LI_n$, the APG subproblem is given by
    $$
    \begin{aligned} S^{k+1} &=\operatorname{argmin}\left{\delta_{\mathrm{S}{+}^{n}}(S)+\left\langle\nabla f\left(\bar{S}^{k}\right), S-\bar{S}^{k}\right\rangle+\frac{1}{2}\left|S-\bar{S}^{k}\right|^{2} | S \in \mathbb{S}^{n}\right} \ &=\operatorname{argmin}\left{\frac{1}{2}\left|S-\bar{S}^{k}+\nabla f\left(\bar{S}^{k}\right)\right|^{2} | S \in \mathbb{S}{+}^{n}\right} \ &=\prod_{\mathrm{S}_{+}^{n}}\left(\bar{S}^{k}-\nabla f\left(\bar{S}^{k}\right)\right) \end{aligned}
    $$
    Observe that the subproblem is like carrying out a fixed-point iteration on the optimality condition.