9. ADMM-DMPC

Posted on 2021-08-16 In Nonlinear Optimization

$$
\begin{array}{ll}
\min\limits_{u\in \mathbb R^{n_u}, v\in \mathbb R^{n_v}} & f(u)+g(v) \
\text{s.t. } & Au+bv=b
\end{array}
$$

where $f$ and $g$ are convex.

Consensus ADMM is suitable to solve the distributed model fitting problem.

Consensus method assign a separate copy of the unknowns, $u_i$, to each processor, and then apply ADMM to solve
$$
\begin{array}{ll}
\min\limits_{u_i\in \mathbb R^{d}, v\in \mathbb R^{d}} & \sum\limits_{i=1}^N f_i(u_i)+g(v) \
\text{s.t. } & u_i = v
\end{array}
$$
This is identical to the previous problem. Let $u=(u_1; \cdots; u_N) \in \mathbb R^{dN}$, $A=I_{dN}\in \mathbb R^{dN\times dN}$, and $B=-(I_d;\cdots; I_d)\in \mathbb R^{dN \times d}$.

DMPC_ADMM

Problem formulation

$N$ independent agents, $\mathcal G=(\mathcal V,\mathcal E)$, vertex set $\mathcal V={1,\cdots,N}$, edge set $\mathcal E\subseteq \mathcal V\times \mathcal V$. neighbors $\mathcal N_i = {j:(i,j)\in \mathcal E}$.

Local host system dynamics:
$$
x_i(t+1) = A_ix_i(t) + B_iu_i(t), \quad i=1,\cdots,N
$$

Global system dynamics:
$$
x(t+1) = Ax(t) + Bu(t),
$$
where $x(t) = [x_1(t)^T, \cdots, x_N(t)^T]$ and $u(t) = [u_1(t)^T, \cdots, u_N(t)^T]^T$.

The objective is to minimize the prediction horizon cost function
$$
J=\sum_{t=0}^{T-1} \sum_{i=1}^N \ell_i ( x_{\mathcal N_i}(t), u_{\mathcal N_i}(t) )
$$
where $\ell_i$ are convex, closed and proper stage cost funciton. $x_{\mathcal N_i}$ and $u_{\mathcal N_i}$ are concatenations of the state and input vectors of agent $i$ and its neighbors.

local constraints are $x_{\mathcal N_i} (t) \in \mathcal X_i$, $u_{\mathcal N_i}(t)\in \mathcal U_i$, where the $\mathcal X_i, \mathcal U_i$ are convex subsets that couple the states and inputs of neighboring agents.

global constraints are $\mathcal X \subseteq \mathbb R^{\sum_i n_i}, \ \mathcal U\subseteq\mathbb R^{\sum_i m_i}$.

Problem:
$$
\begin{array}{ll}
\min & \sum\limits_{t=0}^{T-1} \sum\limits_{i=1}^N \ell_i ( x_{\mathcal N_i}(t), u_{\mathcal N_i}(t) ) + \sum\limits_{i=1}^N \ell_{i,f} (x_{\mathcal N_i}(T), u_{\mathcal N_i}(T)) \
\text{s.t.} & x_i(t+1) = A_ix_i(t) + B_iu_i(t) \
& x_{\mathcal N_i}(t) \in \mathcal X_i, \quad u_{\mathcal N_i}(t) \in \mathcal U_i \
& i = 1,\cdots,N, \quad t=0,1,\cdots, T-1, \
& x_{\mathcal N_i}(T) \in \mathcal X_{i,f}, \quad i = 1, \cdots, N
\end{array}
$$

For centralized problem, there are various methods for choosing the terminal cost functions and constraint set s to the feasibility and stability of the closed-loop system.

For distributed problem, these quantities should be choosen to reflect the information architecture, i.e., the terminal cost and constraint sets for agent $i$ should only depend on $x_{\mathcal N_i}$.

Dealing with constraints

All of the coupling constraints can be reduced to so-called consistency constraints by introducing the local copies of variables at each subsystem and requiring the local copies of the coupling vairables to be the same across coupled subsystem.

let $\mathbf x_i$ be a local variable vector for agent $i$ that includes a copy of the state and input vectors over the finite horizon. For example, if agent 1 is connected to agent 2, the $\mathbf x_1$ includesa local copy of $x_1, u_1, x_2, u_2$ over the finite horizon. Likewise, for agent 2. the $\mathbf x_2$ containts $x_1, u_1, x_2, u_2$.

Let $\mathbf x = [\mathbf x_1^T, \cdots, \mathbf x_N^T]$ be a vector that includes all copies of all variables in the problem.

The problem has the form:
$$
\begin{array}{ll}
\min\limits_{\mathbf x_i \in \mathbf X_i} & \sum\limits_{i=1}^N f_i(\mathbf x_i) \
\text{s.t.} & \mathbf x_i = \mathbf E_i \mathbf x_i, \quad i=1,\cdots,N.
\end{array}
$$

$\mathbf E_i$ is a matrix that picks out components of $\mathbf x$ that match components of the local variable $\mathbf x_i$. The constraint set $\mathbf X_i$ is convex and includes all constraints for agent $i$ and all of its neighbors.

Deduction of this simplification:

$\mathbf x_i = [x_{\mathcal N_i}, u_{\mathcal N_i}] = [x_1,u_1, x_3,u_3, \cdots, x_j,u_j], \quad \forall j \in \mathcal N_i$
$$
f_i(\mathbf x_i) = \sum_{t=0}^{T-1} \ell_i(x_{\mathcal N_i}(t), u_{\mathcal N_i}(t)) + \ell_{i,f}(x_{\mathcal N_i}(T), u_{\mathcal N_i}(T))
$$

The problem now has a separable equality constraints, which enforce consistency of local variable copies.

Parallel Multi-block ADMM

$$
\begin{array}{ll}
\min & f_1(x_1) + \cdots + f_N(x_N) \
\text{s.t.} & A_1 x_1 + \cdots + A_Nx_N = c \
& x_1\in \mathcal X_1, \cdots, x_N \in \mathcal X_N
\end{array}
$$

where $x_i\in \mathbb R^{n_i}$, $A_i\in \mathbb R^{m\times n_i}$, $c\in \mathbb R^m$, $f_i:\mathbb R^{n_i} \rightarrow (-\infty, +\infty]$. The constraint $x_i\in \mathcal X_i$ can be incorporated in objective function $f_i$ via indicator function $\delta_{\mathcal X_i}(x_i)$.

Method 1: Dual decomposition + dual ascent method

$$
\begin{array}{l}
{\mathcal{L}\left(\mathrm{x}{1}, \ldots, \mathrm{x}{N}, \lambda\right) =\sum_{i=1}^{N} f_{i}\left(\mathrm{x}{i}\right)-\lambda^{\top}\left(\sum{i=1}^{N} A_{i} \mathrm{x}{i}-c\right)} \
{\begin{cases}
(x_1^{k+1}, x_2^{k+1}, \cdots, x_N^{k+1}) &= \operatorname{argmin}{x_i} \mathcal L(x_1, \cdots, x_N, \lambda^k)\
\lambda^{k+1} &=\lambda^{k}-\alpha_{k}\left(\sum_{i=1}^{N} A_{i} \mathrm{x}_{i}^{k+1}-c\right)
\end{cases}}\
\end{array}
$$

where $\lambda \in \mathbb{R}^{m}$ is the Lagrangianl multiplier or the dual variable. Since all the $x_i$ are separable in the Lagrangian
function, the $x$-update step reduces to solving $N$ individual $x_i$-subproblems.

Convergence rate: $O(\frac{1}{\sqrt{k}})$.

Method 2: ADMM

Augmented Lagrangian function
$$
\mathcal{L}{\rho}\left(\mathrm{x}{1}, \ldots, \mathrm{x}{N}, \lambda\right)=\sum{i=1}^{N} f_{i}\left(\mathrm{x}{i}\right)-\lambda^{\top}\left(\sum{i=1}^{N} A_{i} \mathrm{x}{i}-c\right)+\frac{\rho}{2}\left|\sum{i=1}^{N} A_{i} \mathrm{x}{i}-c\right|{2}^{2}
$$
introduce a quadratic penalty of the constraints.

To solve multi-block ADMM, one can first convert the multi-block problem into 2-block problem via variable splitting:
$$
\begin{aligned}
\min {\left{\mathbf{x}{i}\right},\left{\mathbf{z}{i}\right}} & \sum{i=1}^{N} f_{i}\left(\mathbf{x}{i}\right)+I{\mathcal{Z}}\left(\mathbf{z}{1}, \ldots, \mathbf{z}{N}\right) \ \text { s.t. } & A_{i} \mathbf{x}{i}-\mathbf{z}{i}=\frac{c}{N}, \forall i=1,2, \ldots, N
\end{aligned}
$$
Here, the convex set $\mathcal{Z}=\left{\left(\mathbf{z}{1}, \ldots, \mathbf{z}{N}\right): \sum_{i=1}^{N} \mathbf{z}_{i}=0\right}$.

Two block: $x:= (x_1, \cdots, x_N)$ and $z:= (z_1, \cdots, z_N)$. –> ADMM

Variable Splitting ADMM

$$
\mathcal{L}{\rho}(\mathrm{x}, \mathrm{z}, \lambda)=\sum{i=1}^{N} f_{i}\left(\mathrm{x}{i}\right)+I{\mathcal{Z}}(\mathrm{z})-\sum_{i=1}^{N} \lambda_{i}^{\top}\left(A_{i} \mathrm{x}{i}-\mathrm{z}{i}-\frac{c}{N}\right)+\frac{\rho}{2} \sum_{i=1}^{N}\left|A_{i} \mathrm{x}{i}-\mathrm{z}{i}-\frac{c}{N}\right|_{2}^{2}
$$

$$
\mathrm{z}^{k+1}=\underset{\left{\mathrm{z}: \sum_{i=1}^{N} \mathrm{z}{i}=0\right}}{\arg \min } \sum{i=1}^{N} \frac{\rho}{2}\left|A_{i} \mathrm{x}{i}^k-\mathrm{z}{i}-\frac{c}{N}-\frac{\lambda_{i}^{k}}{\rho}\right|_{2}^{2}
$$

$$
\begin{cases}
\mathrm{z_i}^{k+1}= \left(A_i\mathrm x_i^k -\frac{c}{N}-\frac{\lambda_{i}^{k}}{\rho} \right) -\frac{1}{N} \left{ \sum_{j=1}^N A_j\mathrm x_j^k -\frac{c}{N}-\frac{\lambda_{j}^{k}}{\rho} \right} \
\mathrm{x}i^{k+1}=\underset{\left{\mathrm{x}i\right}}{\arg \min } \sum{i=1}^{N} f{i}\left(\mathrm{x}{i}\right)+I{\mathcal{Z}}(\mathrm{z}) + \sum_{i=1}^{N} \frac{\rho}{2}\left|A_{i} \mathrm{x}{i}-\mathrm{z}{i}-\frac{c}{N}-\frac{\lambda_{i}^{k}}{\rho}\right|{2}^{2}\
\lambda^{k+1}i =\lambda_i^{k}-\rho\left(A{i} \mathrm{x}{i}^{k+1}-\mathrm{z}_{i}^{k+1}-\frac{c}{N}\right)
\end{cases}
$$

This method introduce splitting variables, which substantially increases the number of variables and constraints in the problem, especially when $N$ is large.

first convert problem to a 2-block problem, then apply the classic ADMM

sGS-ADMM

simply replace the two-block alternating minimization scheme by a sweep of Gauss-Seidel update, namely, update $x_i$ for $i = 1, 2, \cdots, N$ sequentially as follows:
$$
\begin{aligned} \mathrm{x}{i}^{k+1} &=\underset{\mathrm{x}{i}}{\arg \min } \ \mathcal{L}{\rho}\left(\mathrm{x}{1}^{k+1}, \ldots, \mathrm{x}{i-1}^{k+1}, \mathrm{x}{i}, \mathrm{x}{i+1}^{k}, \ldots, \mathrm{x}{N}^{k}, \lambda^{k}\right) \ &=\underset{\mathrm{x}{i}}{\arg \min }\ f{i}\left(\mathrm{x}{i}\right)+\frac{\rho}{2}\left|\sum{j<i} A_{j} \mathrm{x}{j}^{k+1}+A_{i} \mathrm{x}{i}+\sum{j>i} A{j} \mathrm{x}{j}^{k}-c-\frac{\lambda^{k}}{\rho}\right|{2}^{2} \end{aligned}
$$

$$
\begin{cases}
{\mathbf{x}{i}^{k+1}=\min {\mathbf{x}{i}} f{i}\left(\mathbf{x}{i}\right)+\frac{\rho}{2}\left|\sum{j<i} A_{j} \mathbf{x}{j}^{k+1}+A_{i} \mathbf{x}{i}+\sum{j>i} A{j} \mathbf{x}{j}^{k}-c-\frac{\lambda^{k}}{\rho}\right|{2}^{2}} \
{\lambda^{k+1}=\lambda^{k}-\rho\left(\sum_{i=1}^{N} A_{i} \mathbf{x}_{i}^{k+1}-c\right)}
\end{cases}
$$

The algorithm may not converge for $N>3$. Although lack of convergence guarantee, some empirical studies show that Algorithm is still very effective at solving many practical problems.

A disadvantage of Gauss-Seidel ADMM is that the blocks are updated one after another, which is not amenable for parallelization.

Jacobian ADMM

Jacobi-type scheme that updates all the N blocks in parallel
$$
\begin{aligned} \mathrm{x}{i}^{k+1} &=\underset{\mathrm{x}{i}}{\arg \min } \ \mathcal{L}{\rho}\left(\mathrm{x}{1}^{k}, \ldots, \mathrm{x}{i-1}^{k}, \mathrm{x}{i}, \mathrm{x}{i+1}^{k}, \ldots, \mathrm{x}{N}^{k}, \lambda^{k}\right) \ &=\underset{\mathrm{x}{i}}{\arg \min }\ f{i}\left(\mathrm{x}{i}\right)+\frac{\rho}{2}\left|\sum{j\neq i} A_{j} \mathrm{x}{j}^{k+1}+A{i} \mathrm{x}{i}-c-\frac{\lambda^{k}}{\rho}\right|{2}^{2} \end{aligned}
$$

$$
\begin{cases}
{\mathbf{x}{i}^{k+1}=\underset{\mathrm{x}{i}}{\arg \min }\ f_{i}\left(\mathrm{x}{i}\right)+\frac{\rho}{2}\left|\sum{j\neq i} A_{j} \mathrm{x}{j}^{k+1}+A{i} \mathrm{x}{i}-c-\frac{\lambda^{k}}{\rho}\right|{2}^{2} }\
{\lambda^{k+1}=\lambda^{k}-\rho\left(\sum_{i=1}^{N} A_{i} \mathbf{x}_{i}^{k+1}-c\right)}
\end{cases}
$$

this scheme is more likely to diverge than the Gauss-Seidel scheme for the same parameter $\rho$. In fact, it may diverge
even in the two-block case; see [16] for such an example. In order to guarantee its convergence, either additional assumptions or modifications to the algorithm must be made.

Proximal Jacobian ADMM

adding proximal term $\frac{1}{2}|\mathrm x_i - \mathrm x_i^k |{P_i}^2$ for each $\mathrm x_i$-subproblem. Here $P_i \succ 0$ is some symmetric and positive semi-definite matrix and we let $|\mathrm x_i |{P_i}^2 := \mathrm x_i^T P_i \mathrm x_i$.

adding damping parameter $\gamma >0$ for the update of $\lambda$.
$$
\begin{cases}
{\mathbf{x}{i}^{k+1}=\underset{\mathrm{x}{i}}{\arg \min }\ f_{i}\left(\mathrm{x}{i}\right)+\frac{\rho}{2}\left|\sum{j\neq i} A_{j} \mathrm{x}{j}^{k+1}+A{i} \mathrm{x}{i}-c-\frac{\lambda^{k}}{\rho}\right|{2}^{2}+\frac{1}{2}| \mathrm{x}i-\mathrm{x}i^k|{P_i}^2 }\
{\lambda^{k+1}=\lambda^{k}-\gamma\rho\left(\sum{i=1}^{N} A_{i} \mathbf{x}{i}^{k+1}-c\right)}
\end{cases}
$$
where
$$
\left{\begin{array}{l}{P{i} \succ \rho\left(\frac{1}{\epsilon_{i}}-1\right) A_{i}^{\top} A_{i}, i=1,2, \ldots, N} \ {\sum_{i=1}^{N} \epsilon_{i}<2-\gamma}\end{array}\right.
$$
Disadvantages:

global convergence as well as an $o(\frac{1}{k})$ convergence rate under conditions on $P_i$ and $\gamma$.

when the $\mathrm x_i$-subproblem is not strictly convex, adding the proximal term can make the subproblem strictly or strongly convex, making it more stable.

provide multiple choices for matrices $P_i$ with which the subproblems can be made easier to solve.

$x_i$-subproblem contains a quadratic term $\frac{\rho}{2} \mathrm x_i \mathrm A_i^T \mathrm A_i \mathrm x_i$. When $\mathrm A_i^T\mathrm A_i$ is ill-conditioned or computationally expensive to invert, one can let $P_i = D_i - \rho \mathrm A_i^T \mathrm A_i$, which cancels the quadratic term $\frac{\rho}{2} \mathrm x_i \mathrm A_i^T \mathrm A_i \mathrm x_i$ and adds $\frac{1}{2} \mathrm x_i D_i \mathrm x_i$. The matrix $D_i$ can be chosen as some well-conditioned and simple matrix (e.g., a diagonal matrix), thereby leading to an easier subproblem.

$P_i=\tau_iI (\tau_i>0)$: Proximal method

$P_i=\tau_iI - \rho A_i^TA_i (\tau_i>0)$: Prox-linear method. linearize the quadratic penalty term of lagrangian at the current $\mathrm x_i^k$ and adds a proximal term.
$$
\mathbf{x}{i}^{k+1}=\underset{\mathbf{x}{i}}{\arg \min } f_{i}\left(\mathbf{x}{i}\right)+\left\langle\rho A{i}^{\top}\left(A \mathbf{x}^{k}-c-\lambda^{k} / \rho\right), \mathbf{x}{i}\right\rangle+\frac{\tau{i}}{2}\left|\mathbf{x}{i}-\mathbf{x}{i}^{k}\right|^{2}
$$
use $\tau_i I$ to approximate the Hessian matrix $\rho A_i^T A_i$ of the quadratic penalty term.

Adaptive Parameter Tuning

adaptively adjusting the matrices ${P_i}$

The above strategy starts with relatively small proximal terms and gradually increase them. After a finite
number of iterations, ${P_i}$ will remain constant for convergence.

Empirical evidence shows that the paramters ${P_i}$ typically adjust themselves only during the first few iterations and then remain constant afterwards. Alternatively, one may also decrease the parameters after every few iterations or after they have not been updated for a certain number of iterations. But the total times of decrease should be bounded to guarantee convergence. By using this adaptive strategy, the resulting paramters ${P_i}$ are usually much smaller than those required by the condition (2.8), thereby leading to substantially faster convergence in practice.