Xiaoxue Zhang - NUS

NUS Ph.D.

Normalizing Flow

Posted on 2020-06-30 Edited on 2021-08-16 In Reinforcement Learning

Normalizing flow

Research Gap：

现有研究（GAN，VAE， DRAW等deep generative models）不容易进行测试解析（如概率密度不可得）和度量（如KL距离，earth-mover距离不可得）。

目标：分布既能表达真实样本又有传统统计学习所具备的的较好的解释性

Guassian distribtion：采样方便，解析密度可知，KL距离容易计算，“中心极限定理”（任何大数据最终区域Guassian distribution），参数化技巧，提督反转等优点。但真实样本的PDF往往与categorical distribution或者Guassian distribution的差距很大，分布不够集中出现边缘效应，不能应对罕见事物等缺点。

要求：

足够复杂，容得下多个模式，比如增强学习中的图像和评分函数；
足够简单，能采样，能估计密度，能重参数化。

Idea：

使用一个简单的概率分布，从它采样，再对samples作变换（等价于改不变概率分布）。如果让这个变换满足某些milde condition，应该可以得到一个关于变换后的随机变量的非常复杂的PDF。–>得到一个可逆，可计算分布变换，易模拟的模型 –> normalizing flow

意义：

把简单分布（如Gaussian分布）的PDF转化成某种复杂分布。这里的flow值的是data glow经过一系列双射（可逆映射），最终可映射到合适的representative space；normalizing值的事representative space的变量积分为1（这也满足PDF的概念）。
可产生复杂的分布函数，用于reinforcement learning、generative model、variational inference等。

Normalizing flow介绍：

Geerative models with tractable explicit density 值的是$p(x)$可以直接计算出来，包括了大部分auto regressive model (NADE, MADE, PixelRNN/CNN)，还包括flow-based model（通过变量替换，用$p(z)$通过变换得到$p(x)$，即$x=T(z)$。

Flow-based model:

likelihood based
change of variable (采用变量替换的方式解决因变量问题)
tractable（可精确infer $p(x;\theta)$

具体推导：

flow-based model：需要多个简单变换的复合来增强拟合能力$T = T_k \circ \cdots \circ T_1$，且简单变换$T_i$的逆与微分都满足可乘性。得到的分布由base distribution和双射函数得到：

base distribution：可以是gaussian 分布, base_dist()
双射函数：
1. 前向映射 $x=T(u)$, X=bijector.forward(base_dist.saple())
2. 反向映射 $u=T^{-1}(x)$, J=new_dist.log_prob(bijector.inverse(x))+bijector.inverse_log_det_jacobian(x), x是样本。
3. Jacobian矩阵的逆对数行列式 $\log(|\operatorname{det}(J(T^{-1}(x)))|)$. 用于评估转换后分布的对数密度。算法通过最大似然估计，把拟合真实数据的分布问题变成拟合变换后的概率的对数密度问题

在神经网络中，取决于模型任务的不同，部分双射函数可以只实现前向映射$T_k$或后向映射$T_k^{-1}$中的一个。我们用$f_{\phi_k}$表示带有参数$\phi_k$的双射函数层，多个这类简单变换构成的双射函数层复合，成为流模型。

目标函数：$\log(p_X(x_n)) = \log(p_u(T^{-1}(x_n))) + \log(|\operatorname{det}(J(T^{-1}(x_n)))|^{-1})$. 可以用KL divergence和Monte Carlo方法的无偏估计得到：

正向KL divergence
$$
\begin{aligned}
\mathcal{L}(\boldsymbol{\theta}) &=D_{\mathrm{KL}}\left[p_{\mathrm{x}}^{}(\mathrm{x}) | p_{\mathrm{x}}(\mathrm{x} ; \theta)\right] \
&=-\mathbb{E}{p{\mathrm{x}}^{}(\mathrm{x})}\left[\log p_{\mathrm{x}}(\mathrm{x} ; \theta)\right]+\mathrm{const.} \
&=-\mathbb{E}{p{\mathrm{x}}^{}(\mathrm{x})}\left[\log p_{\mathrm{u}}\left(T^{-1}(\mathrm{x} ; \phi) ; \psi\right)+\log \left|\operatorname{det} J_{T^{-1}}(\mathrm{x} ; \phi)\right|\right]+\mathrm{const} .
\end{aligned}
$$
如果我们拥有真实样本$x_n\sim p^(x)$，用Monte Carlo简化后，有
$$
\mathcal{L}(\boldsymbol{\theta}) \approx-\frac{1}{N} \sum_{n=1}^{N} \log p_{\mathrm{u}}\left(T^{-1}\left(\mathbf{x}{n} ; \boldsymbol{\phi}\right) ; \boldsymbol{\psi}\right)+\log \left|\operatorname{det} J{T^{-1}}\left(\mathbf{x}_{n} ; \boldsymbol{\phi}\right)\right|+\text { const. }
$$
当目标分布$p^*(x)$的真实样本可以得到但PDF无法计算出来时，evaluate这个model只需要用到$T^{-1}$和$\operatorname{det}(J(T^{-1}))$，而不用计算$T$与$p(u)$也可以训练model，但是如果我们想在训练好的模型上生成新的样本，$T$与$p(u)$也仍然需要.

在具体的实践中，我们可以通过采用梯度下降的方法最小化损失函数$\mathcal{L}(\boldsymbol{\theta})$:
$$
\begin{array}{l}
\nabla_{\phi} \mathcal{L}(\theta) \approx-\frac{1}{N} \sum_{n=1}^{N} \nabla_{\phi} \log p_{u}\left(T^{-1}(\mathbf{x} n ; \phi) ; \psi\right)+\nabla \phi \log \left|\operatorname{det} J_{T^{-1}}(\mathbf{x} n ; \phi)\right| \
\nabla \psi \mathcal{L}(\theta) \approx-\frac{1}{N} \sum_{n=1}^{N} \nabla_{\psi} \log p_{u}\left(T^{-1}\left(\mathbf{x}_{n} ; \phi\right) ; \psi\right)
\end{array}
$$
反向KL divergence

当我们拥有真实样本$p^*$的分布，但很难得到真实样本分布的采样$x_n$时，可以通过反向KL divergence来评估模型，具体的到做事通过变量替换，对$x$的操作转换成对$u$的操作：
$$
\begin{aligned}
\mathcal{L}(\boldsymbol{\theta}) &=D_{\mathrm{KL}}\left[p_{\mathrm{x}}(\mathbf{x} ; \boldsymbol{\theta}) | p_{\mathrm{x}}^{}(\mathbf{x})\right] \
&=\mathbb{E}{p{\mathrm{x}}(\mathbf{x} ; \boldsymbol{\theta})}\left[\log p_{\mathrm{x}}(\mathbf{x} ; \boldsymbol{\theta})-\log p_{\mathrm{x}}^{}(\mathbf{x})\right] \
&=\mathbb{E}{p{\mathrm{u}}(\mathbf{u} ; \psi)}\left[\log p_{\mathrm{u}}(\mathbf{u} ; \boldsymbol{\psi})-\log \left|\operatorname{det} J_{T}(\mathbf{u} ; \boldsymbol{\phi})\right|-\log p_{\mathrm{x}}^{}(T(\mathbf{u} ; \boldsymbol{\phi}))\right]
\end{aligned}
$$
当我们能够计算$T$，它的雅可比行列式，evaluate 目标分布$p^$以及从$$p(u)$$中sample时，使用反向KL divergence是合适的，事实上，即使我们只能evaluate 目标分布乘以某个正则化常数，也可以最小化上式, $p^*_{\mathrm x}(\mathrm x) = \tilde p_{\mathrm x}(\mathrm x)/C$，其中，$\tilde p_{\mathrm x}(\mathrm x)$是一个更好处理的分布，因此，可以得到
$$
\mathcal{L}(\boldsymbol{\theta})=\mathbb{E}{p{u}(\mathbf{u} ; \psi)}\left[\log p_{\mathrm{u}}(\mathbf{u} ; \boldsymbol{\psi})-\log \left|\operatorname{det} J_{T}(\mathbf{u} ; \boldsymbol{\phi})\right|-\log \tilde{p}{\mathrm{x}}(T(\mathbf{u} ; \boldsymbol{\phi}))\right]+\text { const. }
$$
当我们有N条来自于$p{u}(\mathbf{u} ; \psi)$的samples，为了最小化上式，使用蒙特卡罗法，并对变换$T$的参数$\phi$求偏导，可得
$$
\nabla_{\phi} \mathcal{L}(\boldsymbol{\theta}) \approx-\frac{1}{N} \sum_{n=1}^{N} \nabla_{\phi} \log \left|\operatorname{det} J_{T}\left(\mathbf{u}{n} ; \boldsymbol{\phi}\right)\right|+\nabla{\phi} \log \tilde{p}{\mathbf{x}}\left(T\left(\mathbf{u}{n} ; \boldsymbol{\phi}\right)\right)
$$
由于$u_n\sim p_u(u)$，且$\phi$不更新因为$u_n$的采样$\phi$是固定的，具体优化如下：
$$
\nabla_{\phi} \mathcal{L}(\boldsymbol{\theta}) \approx-\frac{1}{N} \sum_{n=1}^{N} \nabla_{\phi} \log \left|\operatorname{det} J_{T}(\mathbf{u}_n ; \boldsymbol{\phi})\right|+\nabla \boldsymbol{\phi} \log \tilde{p}_\mathbf{x}(T(\mathbf{u}_n ; \boldsymbol{\phi}))
$$
反向KL散度，是通过对$p(u)$采样通过变量替换以避开$x$来训练评估我们的模型。其常见的应用有变分推断和模型蒸馏。变分推断指的是，先验概率或似然函数$p(x)$ is intractable ，变分推断想求后验 $p(z|x)$，只能通过构造新的分布去逼近。那么分布是已知的，对隐变量的采样是未知的，符合我们的条件。在模型蒸馏中， $p^*$ 符合同样可求，但采样不方便的境地。因此采用反向KL散度评估模型去训练蒸馏，使新的模型具有快速采样的能力。

如何找到映射函数T既是可逆的而且逆变换计算容易，这个$T$又可以简便计算雅克比行列式？对于逆变换计算这一点，如果无法高效进行逆计算，那么评估训练模型以及采样都是困难的，取决于$f_{\phi_k}$实现的是$T_k$还是$T_k^{-1}$，因此逆变换的高效计算是个因模型而异的难题。

basic trick: 计算雅可比行列式最好优化成三角矩阵 diagonal matrix，because the determinant of a triangular matrix is just the product of the diagonal elements, 因此将原先的$O(D^3)$复杂度降为$O(D)$，能够解决部分计算问题！

与VAE和GAN的比较

normalizing flow：

exact log-likelihood evaluation: $\log p_\theta(x) = \log p_\theta(z) + \sum_{i=1}^K \log \left|\operatorname{det}\left( \frac{\part f_i*{-1}}{\part z_i} \right)\right|$
exact posterior inference : $z=f^{-1}(x)$

VAE:

Lower bound on log-likelihood (ELBO)
approximate posterior $q_\phi(z|x)$ (encoder network) [because the actural posterior is intractable]

GAN:

No log-likelihood evaluation (optimize a minx objective involving the discriminator and generator)
no latent variable inference (generator cannot ensure full support over data)

模型1-Finite compositions

Finite compositions model的优点在于，增加简单变换$T$的个数为$K$，仅仅增加$O(K)$的计算复杂度。

Finite compositions model: construct a flow with transformation $T$ by composing a finite number of simple transformations $T_k$. 有限个简单变换$T$构成的flow mode

Autgressive flows
Linear flows
Residual flows

耦合技术 Coupling layers

NICE (Dihn et al. 2015), Real NVP (Dihn et al. 2017), Glow (Kingma and Dhariwal, 2018), WaveGlow (Prenger et al. 2019), FloWaveNet (kim et al. 2019), Flow++ (Ho et al. 2019).

优点：

采用了计算对称性，模型评估和求逆计算的速度快（efficiently）
模型评估和采样可以在一次神经网络传递中进行（allow both density evaluation and sampling to be performed in a single neural network pass）

缺点：

流模型的表达力可能不足。因此需要考虑计算和表达力的动态平衡。

自回归技术

应用

概率建模

样本生成

模型推断

表示学习（构建产生式模型和预测模型的混合模型，如监督学习和强化学习）

概率模型的建模，推断，监督学习和强化学习这些领域。

稀疏奖励下的强化学习

Posted on 2020-06-30 Edited on 2021-08-16 In Reinforcement Learning

稀疏奖励下的强化学习

当agent无法得到足够多的有效reward或者说得到的是稀疏奖励（sparse reward），会导致agent学习缓慢，甚至无法有效学习。

解决方法：

1. Use data to improve learning

1.1 Curiosity driven

paper: Episodic Curiosity through Reachability

https://arxiv.org/pdf/1810.02274.pdf

该方法改变了 agent「好奇心」的生成方式和奖励机制，将 agent 对环境信息观察的记忆信息引入奖励机制中，有效降低了 agent「原地兜圈」、「拖延」等不良行为，提升了强化学习模型的性能。

本文引入「好奇心（Curiosity）」的基本思路是：只对那些需要花费一定努力才能达到的结果给予奖励（这部分结果一定是在已经探索过的环境部分之外）。根据探索环境所需要的步骤数量来衡量这些努力。为了估计步骤数量，本文训练了一个神经网络近似器：给定两个观测值，预测将它们分开需要执行多少步。图 1 给出了通过可达性（Reachability）来说明行动的新颖性（Novelty）的概念。图中的节点是观测值，边是可能的转换。蓝色的节点已经在记忆内存中，绿色的节点可以在 k=2 步内从记忆内存中到达（不新颖），橙色的节点距离较远—需要超过 k 步才能到达（新颖）。

1.2 Reward shaping

1.3 Imitation learning

1.4 Curriculum learning

2. Improve model

2.1 Hierarchical reinforcement learning

2.2 Meta-learning

Chance Constrained Problem

Posted on 2020-06-30 Edited on 2021-05-13 In Chance Constraints

Introduction

A formulation of an optimization problem.

Ensure the probability of meeting constraints is above a certain level.
AKA. Restrict the feasible region so that the confidence level of solution is high.

$$
\begin{aligned}
\min &\ f(x, \xi) \
\text{s.t. } &\ g(x,\xi) = 0\
& \ h(x,\xi) \geq 0
\end{aligned}
$$

where $x$ is decision variables, $\xi$ is the vector of uncertainty.

Deterministic Reformulations:

expectation constraints:
$$
h(x,\xi) \geq 0 \ \rightarrow \ h(x,\mathbb{E} \xi) \ge 0
$$
easy to solve, solutions at low costs; But, solutions not robust
Worst-case constraints:
$$
h(x,\xi) \geq 0 \ \rightarrow \ h(x, \xi) \ge 0 \ \ \forall \xi
$$
absolutely robust solution; But, solutions expensive or not exit
Chance Constraints:
$$
h(x,\xi)\ge 0 \rightarrow \ \ \underbrace{P(h(x,\xi)\ge 0)}_{\phi(x)} \ge p, \ p\in[0,1]
$$
robust solutions, not too expensive; But, difficult to solve

Chance constraints

Inequality constraints: $P(h(x, \xi) \geq 0) \geq p$

in some practice, output variables $y$ (belongs to $x$) should be $y_\min \leq y(\xi) \leq y_\max$. Then, $P(y_\min \leq y(\xi) \leq y_\max) \geq p$

Random Right-hand Side: Use distribution function $F_\xi (h(\xi \ge p))$

Single vs. Joint Chance Constraint

since both $h(x,\xi)$ and $y(\xi)$ are vectors,

individual chance constraint:
$$
h_i(x,\xi) \geq 0 \rightarrow P(h_i(x,\xi) \geq 0)\geq p \quad (i=1,\cdots, m)
$$
Random right-hand side: $h_i(x,\xi) = f_i(x)-\xi_i $

$P(f_i(x)\ge \xi_i) = P(h_i(x-\xi)\ge 0)\ge p, (i=1,\cdots,m) \iff f_i(x)\ge \underbrace{q_i(p)}_{p-\text{quantile of } \xi_i} \ (i=1, \cdots, m)$

Individual chance constraints are easy to solve, however, they only guarantee that each line satisfies the constraint to a certain confidence level.
Joint chance constraint:
$$
h_i(x,\xi) \geq 0 \rightarrow P(h_i(x,\xi) \geq 0 \ (i=1,\cdots, m))\geq p \quad
$$
Joint chance constraint ensures that the constraint as a whole is satisfied to a certain confidence level, however, it is incredibly difficult to solve, even numerically.

Solving Method:

Simple case (decision and random variables can be decoupled):

chance constraints can be relaxed into deterministic constraints using probability density functions (pdf). Then, use LP or NLP to solve.
- Linear ( $h$ is leaner in $\xi$ ), i.e., $P(h(x)\ge 0)\ge p$
- Two types:
  - separable model: $P(y(x)\ge A\xi)\ge p$
  - bilinear model: $P(\Xi \cdot x \geq b) \geq p$
- calculate the probability by using the probability density function (pdf) and substitute the left hand side of the constraint with a deterministic expression
  - PDF :
    - statistical regression (abundant data),
    - interpolation and extrapolation (few data)
    - Marginal distribution function (when uncertain variables has correlations)
  - Feasible region depends on confidence level
    - high confidence level $\rightarrow$ small feasible region
    - low confidence level $\rightarrow$ larger feasible region
    - can make different confidence level with important
- Nonlinear
  
  relax problem by transforming into deterministic NLP problem.
  - Solving nonlinear chance constrained problem is difficult, because nonlinear propogation makes it hard to obtain the distribution if output variables when distribution of uncertainty is unknown.
    - strategy: back-mapping
    - robust optimization
    - sample average approximation
Complex case (decision and random variables interact and cannot decouple)
- impossible to solve.

Challenges:

Structural Challenges

Structural Property:

Probability function: $\varphi(x) :=P(h(x, \xi) \geq 0)$
set of feasible decisions: $M :=\left{x \in \mathbb{R}^{n} | \varphi(x) \geq p\right}$

Proposition ====(Upper Semicontinuity, closeness)
if $h_i$ are usc, then so is $\varphi$. Then, $M$ is closed

==Proposition==
if $h_i$ are continuous, and $P(h_i(x,\xi)=0)=0, \forall x \in R^n \forall i \in {1,\cdots,s}$, then $\phi$ is continuous too.

==Proposition==
if $\xi$ has pdf $f_\xi$, i.e., $F_\xi(z)=\int_{-\inf}^{z} f_\xi(x) dx$, then $F_\xi$ is continuous.

==Theorem== (Wang 1985, Romisch/Schultz 1993)
If $\xi$ has a density $f_\xi$, then $F_\xi$ is Lipschitz continuous if and only if all marginal densities $f_{\xi_i}$ are essentially bounded.

==Theorem (R.H./R¨omisch 2010)==
If $\xi$ has a density $f_\xi$ such that $f^{−1/s}_\xi$ is convex, then $F_\xi$ is Lipschitz continuous

Assumption satisfied by most prominent multivariate distributions:
Gaussian, Dirichlet, t, Wishart, Gamma, lognormal, uniform

Numerical Challenges:

$$
P(y(x)\ge \xi)\ge p = P(h(\xi \leq x)\ge p) = F_\xi (h(x) \ge p)
$$

If $\xi \sim \mathcal{N}(\mu, \Sigma)$ with $\Sigma$ positive definite (regular Gaussian),
$$
\frac{\partial F_{\xi}}{\partial z_{i}}(z)=f_{\xi_{i}}\left(z_{i}\right) \cdot F_{\tilde{\xi}\left(z_{i}\right)}\left(z_{1}, \ldots, z_{i-1}, z_{i+1} \ldots, z_{s}\right) \quad(i=1, \ldots, s)
$$
Efficient method to compute $F_\xi$ (and $\nabla F_{\xi}$) :

code by A. Genz. computes Gaussian probabilities of rectangles:

$\mathbb{P}(\xi \in[a, b]) \quad\left(F_{\xi}(z)=\mathbb{P}(\xi \in(-\infty, z])\right.$

allows to consider problems with up to a few hundred random variables.
cope with more complicated models, like $\mathbb{P}(h(x) \geq A \xi) \geq p, \quad \mathbb{P}(\Xi \cdot x \geq b) \geq p$:

Compute Derivatives for Gaussian Probabilities of Rectangles

Let $\xi \sim \mathcal{N}(\mu, \Sigma)$ with $\Sigma$ positive definite. Consider a two-sided probabilistic constraint: $\mathbb{P}(\xi \in[a(x), b(x)]) \geq p$. Then, $\alpha_{\xi}(a(x), b(x)) \geq p, \quad$ where $\quad \alpha_{\xi}(a, b) :=\mathbb{P}(\xi \in[a, b])$

Partial derivatives:

Method 1:

Reduction to distribution functions, then use known gradient formula.
$$
\alpha_{\xi}(a, b)=\sum\limits_{i_{1}, \ldots, i_{s} \in{0,1}}(-1)^{\left[s+\sum_{j=1}^{s} i_{j}\right]} F_{\xi}\left(y_{i_{1}}, \ldots, y_{i_{s}}\right), \quad y_{i_{j}} :=\left{
\begin{array}{ll}
{a_{j}} & {\text { if } i_{j}=0} \
{b_{j}} & {\text { if } i_{j}=1}
\end{array}
\right.
$$
For dimension $s$, there are $2^s$ terms in the sum. Not practicable!
Method 2:
$$
\alpha_{\xi}(a, b)=\mathbb{P}\left(\left(\begin{array}{c}{\xi} \ {-\xi}\end{array}\right) \leq\left(\begin{array}{c}{b} \ {-a}\end{array}\right)\right), \quad\left(\begin{array}{c}{\xi} \ {-\xi}\end{array}\right) \sim \mathcal{N}\left(\left(\begin{array}{c}{\mu} \ {-\mu}\end{array}\right),\left(\begin{array}{cc}{\Sigma} & {-\Sigma} \ {-\Sigma} & {\Sigma}\end{array}\right)\right)
$$
$\Longrightarrow$ Singular normal distribution, gradient formula not available

==Proposition (Ackooij/R.H./M¨ oller/Zorgati 2010)==
Let $\xi \sim \mathcal{N}(\mu, \Sigma)$ with $\Sigma$ positive definite and $f_\xi$ the corresponding density. Then, $\frac{\partial \alpha_{\xi}}{\partial b_{i}}(a, b)=f_{\xi_{i}}\left(b_{i}\right) \alpha_{\tilde{\xi}}\left(b_{i}\right)(\tilde{a}, \tilde{b}) ; \quad \frac{\partial \alpha_{\xi}}{\partial a_{i}}(a, b)=-f_{\xi_{i}}\left(a_{i}\right) \alpha_{\tilde{\xi}}\left(a_{i}\right)(\tilde{a}, \tilde{b})$ with the tilda-quantities defined as in the gradient formula for Gaussian distribution functions.

$\Longrightarrow$ Use Genz’ code to calculate $\alpha_\xi$ and $\nabla_{a, b} \alpha_{\xi}$ at a time.

Derivatives for separated model with Gaussian data:

Let $\xi \sim \mathcal{N}(\mu, \Sigma)$ with $\Sigma$ positive definite and consider the probability function:
$$
\beta_{\xi, A}(x) :=\mathbb{P}(A \xi \leq x)
$$
If the rows of $A$ are linearly independent, then put $\eta :=A \xi \sim \mathcal{N}\left(A \mu, A \Sigma A^{T}\right)$, then
$$
\Longrightarrow \quad \beta_{\xi, A}(x)=F_{\eta}(x)\quad \text{regular Gaussian distribution function}
$$
otherwise, $F_\eta$ is a singular Gaussian distribution function (gradient formula not available)

==Theorem (R.H./M¨ oller 2010, see talk by A. M¨ oller, Thursday, 3.20 p.m., R. 1020)==
$\frac{\partial \beta_{\xi, A}}{\partial x_{i}}(x)=f_{A_{i} \xi}\left(x_{i}\right) \beta_{\tilde{\xi}, \tilde{A}}(\tilde{x})$ where $\tilde{\xi} \sim \mathcal{N}\left(0, l_{s-1}\right)$ and $\tilde{A}$, $\tilde{x}$ can be calculated explicitly from $A$ and $x$.

use e.g., Deak’s code for calculating normal probabilities of convex sets.

Derivatives for bilinear model with Gaussian data

probability function: $\gamma(x) :=\mathbb{P}(\Xi x \leq a)$ with normally distributed $(m,s)$ coefficient matrix. Let $\xi_i$ be the $i$th row of $\Xi$\

$$
\begin{array}{rCl}
\gamma(x) &=F_{\eta}(a) \ \nabla \gamma(x) &=\sum_{i=1}^{m} \frac{\partial F_{\eta}}{\partial z_{i}}(\beta(x)) \nabla \beta_{i}(x)+\sum_{i, j=1}^{m} \frac{\partial^{2} F_{\eta}}{\partial z_{i} \partial z_{j}}(\beta(x)) \nabla R_{i j}(x)
\end{array}
$$

$$
\begin{aligned} \eta & \sim \mathcal{N}(0, R(x)) \ \mu(x) &=\left(\mathbb{E} \xi_{i}^{T} x\right){i=1}^{m} \ \Sigma(x) &=\left(x^{T} \operatorname{Cov}\left(\xi{i}, \xi_{j}\right) x\right){i, j=1}^{m} \ D(x) &=\operatorname{diag}\left(\Sigma{i i}^{-1 / 2}(x)\right)_{i=1}^{m} \ R(x) &=D(x) \Sigma(x) D(x) \ \beta(x) &=D(x)(a-\mu(x)) \end{aligned}
$$

e.g. hold constraints and gradients constant while calculating probabilities. [No closed-form analytical representation of probabilities]

Convexity Challenges

$$
P(y(x)\ge \xi)\ge p = P(h(\xi \leq x)\ge p) = F_\xi (h(x) \ge p)
$$

Global optimization: if the optimization problem is convex, the local optimum is global optimum; otherwise, methods such as McCormick Envelope Approximation should be used.

if $F_\xi (h(x) \ge p)$ is concave, the optimization problem is convex.

Sufficient:
1. components $h_j$ concave: e.g., $h$ is a linear mapping
2. $F_\xi $ increasing: automatic (distribution function)
3. $F_\xi$ concave: Never, because pdf bounded by 0 and 1.
If exist a strictly increasing function $\varphi: \mathbb{R}{+} \rightarrow \mathbb{R} $ such that $\varphi \circ F{\xi}$ is concave:

i.e., $F_\xi (h(x) \ge p) \iff \varphi(F_\xi (h(x)) \ge \varphi(p)$.

then, $\underbrace{\varphi \circ F_{\xi}}_{\text {increasing } \atop \text { concave }} \circ h$ is concave if the components $h_i$ are concave
- potential candidates: $\varphi = \log$, $\varphi=-(\cdot)^{-n}$
==Theorem (Pr´ekopa 1973 )==
Log-concavity of the density implies log-concavity of the distribution function.

example log-concave distribution: Normal, Gaussian, Dirichlet, Student, lognormal, Gamma, uniform, Wishart

Separable model:

==Corollary (Convexity in the separable model)==
Consider the feasible set $M := {x \in \mathbb{R}^n | P(A\xi \leq h(x)) \geq p}$. Let $\xi$ have a density $f_\xi$ such that $\log f_\xi$ is concave and let the $h_i$ be concave (e.g., $h$ linear). Then, $M$ is convex for any $p\in[0, 1]$.

Bilinear model:

feasible set $M :=\left{x \in \mathbb{R}^{n} | \mathbb{P}(\Xi x \leq a) \geq p\right}$

==Theorem (Van de Panne/Popp, Kataoka 1963, Kan 2002, Lagoa/Sznaier 2005)==
Let $\Xi$ have one row only which has an elliptically symmetric or log-concave symmetric distribution (e.g., Gaussian). Then, $M$ is convex for $p\ge 0.5$.

==Theorem (R.H./Strugarek 2008)==
Let the rows $\xi_i$ of $\Xi$ be Gaussian according to $\xi_i \sim \mathcal{N}(\mu_i,\Sigma_i)$. If the $\xi_i$ are pairwise independent, then $M$ is convex for $p > \Phi(\max{\sqrt{3},\tao}), where

$\Phi =$ 1-dimensional standard normal distribution function
$\tau=\max {i} \lambda{\max }^{(i)}\left[\lambda_{\min }^{(i)}\right]^{-3 / 2}\left|\mu_{i}\right|$
-$\lambda_{\max }^{(i)}, \lambda_{\min }^{(i)} :=$ largest and smallest eigenvalue of $\Sigma_i$.
Moreover, $M$ is compact for $p>\min {i} \Phi\left(\left|\mu{i}\right|{\Sigma{i}^{-1}}\right)$

A challenging issue comes in when the distribution is uniform or normal with independent components, since these may lack strong log concavity in certain circumstances. Furthermore, it should be noted that the Student’s t-distribution, the Cauchy distribution, the Pareto distribution, and the log-normal distribution are not log concave.

Stability Challenges

Distribution of $\xi$ rarely known $\Longrightarrow$ Approximation by some $\eta$ $\Longrightarrow$ Stability?

==Theorem (R.H./W.R¨omisch 2004)==

$f$ convex, $C$ convex, closed, $\xi$ has log-concave density
\Psi(\xi)$ nonempty and bounded
$\exists x \in C : \quad \mathbb{P}(\xi \leq A x)>p$ (Slater point)
Then, $\Psi$ is upper semicontinuous at $\xi$:
$$
\Psi(\eta) \subseteq \Psi(\xi)+\varepsilon \mathbb{B} \quad \text { for } \quad \sup {z \in \mathbb{R}^{s}}\left|F{\xi}(z)-F_{\eta}(z)\right|<\delta
$$
If in addition
$f$ convex-quadratic, $C$ polyhedron,
$\xi$ has strongly log-concave distribution function,
then $\Psi$ is locally Hausdorff-Holder continuous at $\xi$:

$$
\left.d_{\text {Haus }}(\psi(\eta), \psi(\xi)) \leq L \sqrt{\sup {z \in \mathbb{R}^{s}}\left|F{\xi}(z)-F_{\eta}(z)\right|} \quad \text { (locally around } \xi\right)
$$

pdf is nor precise, most is approximation.

slight changes in distribution may cause major changes to optimum
if assumed distribution is slightly different from actual distribution, not stable.
In order to achieve Hölder- or Lipschitz stability one has to impose further structural conditions, such as strong log-concavity, strict complementarity, and C1,1-differentiability.

Applications

Engineering:

UGV, UAV: Optimal Navigation and Reliable Obstacles Avoidance

System description:
$$
\begin{aligned}
\dot x & = f(x,u,\xi)\
0 & = g(x,u,\xi)
\end{aligned}
$$
$\xi$ may be:

random obstacles
sensor measurement errors
state generated errors

Probability density function of these variables are estimated based on previous data. and formulated into chance constraints.

Based on these constraints, unmanned vehicles can travel through a specific region with the shortest distance while avoiding random obstacles at a high confidence level.2

Power system management:

Risk management

Conclusion

The chance-constraint method is a great way to solve optimization problems due to its robustness. It allows one to set a desired confidence level and take into account trade-off between two or more objectives. However, it can be extremely difficult to solve. Probability density functions are often difficult to formulate, especially for the nonlinear problems. Issues with convexity and stability of these questions mean that small deviations from the actual density function could cause major changes in the optimal solution. Further complications are added when the variables with and without uncertainty can not be decoupled. Due to these complications, there’s no single approach to solving chance-constraint problems. Though the most common approach to simple questions is to transform the chance constraints into deterministic functions by decoupling the variables with and without uncertainty and using probability density functions.

This approach has many applications in the engineering field and the finance field. Namely, it is widely used in energy management, risk management, and more recently, determination of production level, performance of industrial robots, and performance of unmanned vehicle. A lot of research in the area focuses on solving nonlinear dynamic questions with chance constraints.

Uncertainty Modelling in AI

Posted on 2020-06-30 Edited on 2021-08-16 In Machine Learning

Uncertainty Modelling in AI

All materials are from Module CS5340 (NUS Computing)

PGM (Probabilistic Graphical Modeling)

gain global insight based on local observations

Key Ideas:

Represent the world as a collection of random variables $X_1, \cdots, X_N$ with joint distribution $p(X_1, \cdots, X_N)$
Learn the distribution from data.
Perform “inference” (compute conditional distributions ($p(X_i|X_1=x_1,\cdots,X_N=x_N)$).

Introduction

Probability space

Three parts:

Outcome space $\Omega$: space of possible outcomes
Event space $E$: $E\subseteq 2^{\Omega}$, subset of power set of $\Omega$
- contains empty $\emptyset$ and trivial event $\Omega$
- closed under union. if $\alpha, \beta \in E$, so is $\alpha \cup\beta$
- closed under complement. if $\alpha \in E$, So is $\Omega -\alpha$
Probability distribution $P$: mapping from events in $E$ into real values
- Non negativity
- sum to 1
- Mutually disjoint events: if $\alpha, \beta \in E$, and $\alpha \cap \beta=\emptyset$, then $P(\alpha\cup\beta)=P(\alpha)\cup P(\beta)$.

Probability distribution:

$X$: random variables. $x$: generic values [$x\in Val(X)$].
Indicator Random Variables: map every outcome to 0 or 1.: Just indication
Discrete vs.. continuous
- Discrete:
  - PMF (Probability mass function): $p(x)$
- Continuous:
  - PDF (Probability Density function) : $p(x):R\rightarrow R_{\geq 0}$
  - $P(X)$ is a cumulative function

Probability:

Joint probability:

Basic operations

Marginalization

Recover probability distribution of any variable in a joint distribution by integrating (or summing) over all other variables.

$${} p(x)=\int p(x,y) dy ;\quad p(y)=\int(x,y) dx$$

$p(x)=\sum\limits_y p(x,y) ;\quad p(y)=\sum\limits_x(x,y) $

Conditional probability

$p(x|Y=y^*)$

can be extracted from joint probability.

$P(x|Y=y^*)=\frac{p(x,Y=y^*)}{\int\limits_x p(x,Y=y^*) dx} = \frac{p(x,Y=y^*)}{p(Y=y^*)}$

i.e. $p(x|y)=\frac{p(x,y)}{p(y)}$

So, $p(x,y)=p(x|y)p(y)=p(y|x)p(x)$ : chain rule of product rule

Bayes rule

$$p(x|y)p(y)=p(y|x)p(x) \Longrightarrow p(y|x)=\frac{p(x|y)p(y)}{p(x)}=\frac{p(x|y)p(y)}{\int p(x,y) dy} = \frac{p(x|y)p(y)}{\int p(x|y)p(y)dy}$$

Independence

independence of X and Y means that every conditional distribution is the same.

when variables are independent, $p(x,y)=p(x|y)p(y)=p(x)p(y)$

Expectation

$E[f(x)]=\sum\limits_x f(x)p(x)\ \text{or} \ \int f(x)p(x)dx$

$E[f(x)+g(x)]=E[f(x)]+E[g(x)]$

$E[f(x)g(x)]=E[f(x)]E[g(x)]$(if X, Y independent)

Probability Distribution

Data Type	Domain	Distributions	PDF(PMF)	Parameter
Univariate, discrete, binary	$x\in{0,1}$	Bernoulli	$p(x)=\lambda^x(1-\lambda)^{1-x}$	$\lambda$
Univariate, discrete, multi-valued	$x\in{1,2,\cdots,K}$	Categorical	$p(\mathbb{x})=\prod\limits_{k=1}^K \lambda_k^{x_k}=\lambda_k$	$\lambda=[\lambda_1,\cdots,\lambda_K]^T$, where $\lambda\geq 0$, $\sum_k\lambda_k=1$
Univariate, continuous, unbounded	$x\in R$	Univariate normal (Gaussian)	$p(x	\mu,\sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}} \exp -\frac{(x-\mu)^2}{2\sigma^2}$
Univariate, continuous, bounded	$x\in [0,1]$	Beta	$p(x)=\frac{\Gamma[\alpha+\beta]}{\Gamma[\alpha]\Gamma[\beta]} x^{\alpha-1}(1-x)^{\beta-1}$	$\alpha$, $\beta$
Multivariate, continuous, unbounded	$\mathbf{x}\in \mathbb{R}^D$	Multivariate normal	$\frac{1}{(2\pi)^{D/2}	\boldsymbol{\Sigma}
Multivariate, continuous, bounded, sum to 1	$\mathbf{x}=[x_1,x_2,\cdots,x_K]^T$, $x_k\in[0,1]$, $\sum\limits_{k=1}^K x_k=1$	Dirichlet	$p\left(\mathbb{x}\right)=\frac{\Gamma\left[\sum_{k=1}^{K} \alpha_{k}\right]}{\prod_{k=1}^{K} \Gamma\left[\alpha_{k}\right]} \prod_{k=1}^{K} x_{k}^{\alpha_{k}-1}$	$\alpha_k$ (k个)
Bivariate, continuous, $x_1$ unbounded, $x_2$ bounded below	$\mathbf{x}=[x_1, x_2]$, $x_1\in\mathbb{R}$, $x_2\in\mathbb{R}^+$	Normal-scaled inverse gamma
Multivariate vector $\mathbf{x}$ and matrix $\mathbf{X}$. $\mathbf{x}$ Unbounded, $\mathbf{X}$ square, positive definite	$\mathbf{x}\in\mathbb{R}^K$, $\mathbf{X}\in\mathbb{R}^{K\times K}$, $\mathbf{z}^T\mathbf{X}\mathbf{z}>0\quad \forall\mathbf{z}\in\mathbb{R}^K$	Normal inverse Wishart

Conjugate Distributions

Parameters of conjugate distributions are known as hyperparameters because they control the parameter distribution
Aim: Learning the parameters $\theta$ of a probability distribution:

$$p(\theta|x)=\frac{p(x|\theta)p(\theta)}{\int p(x|\theta)p(\theta)d\theta}$$
- $p(\theta)$: prior. $p(x|\theta)$: likelihood. $p(\theta|x)$: posterior
- Use prior and likelihood to compute posterior. Posterior has the same form as prior $\longrightarrow$ conjugate
使用共轭先验的原因是，可以使得先验分布和后验分布的形式相同，这样符合人的直观感受，另一方面，是可以形成一个先验链，即现在的后验分布可以作为下一次计算的先验分布，也就是带来了计算的简单性。
使得Bayes inference更加方便，比如在 sequential Bayesian inference中，得到一个observation之后，可以算出一个posterior。由于选取的是conjugate prior，因此posterior和prior的形式一样，可以把该posterior当作新的prior，用于下一次的observation，继续下一次迭代。

Distribution	Domain	Prior	Prior Distribution	hyper parameters
Bernouli	$x\in{0,1}$	Beta	$p(\lambda)=\frac{\Gamma[\alpha+\beta]}{\Gamma[\alpha]\Gamma[\beta]}\lambda^{\alpha-1}(1-\lambda)^{\beta-1}$, $\lambda\in[0,1]$	$\alpha$, $\beta$
Binomial	$x\in{0,1,\cdots,n}$	Beta	$p(\lambda)=\frac{\Gamma[\alpha+\beta]}{\Gamma[\alpha]\Gamma[\beta]}\lambda^{\alpha-1}(1-\lambda)^{\beta-1}$, $\lambda\in[0,1]$	$\alpha$, $\beta$
Categorical	$x\in {1, 2, \cdots,K}$	Dirichlet	$p\left(\lambda_{1}, \ldots, \lambda_{K}\right)=\frac{\Gamma\left[\sum_{k=1}^{K} \alpha_{k}\right]}{\prod_{k=1}^{K} \Gamma\left[\alpha_{k}\right]} \prod_{k=1}^{K} \lambda_{k}^{\alpha_{k}-1}$	k 个 $\alpha_k>0$
Univariate normal (Given Variance, mean unknot)		Univariate normal
Univariate normal (Given Mean, Variance unknown)		Gamma
Univariate normal (Both Variance and Mean unknown)	$x\in\mathbb{R}$	normal inverse gamma	$p\left(\mu, \sigma^{2}\right)=\frac{\sqrt{\gamma}}{\sigma \sqrt{2 \pi}} \frac{\beta^{\alpha}}{\Gamma[\alpha]}\left(\frac{1}{\sigma^{2}}\right)^{\alpha+1} \exp \left[-\frac{2 \beta+\gamma(\delta-\mu)^{2}}{2 \sigma^{2}}\right]$	$\alpha, \beta, \gamma>0, \text{and}\ \delta\in\mathbb{R}$

Multivariate normal	$\mathbf{x}\in\mathbb{R}^k$	normal inverse Wishart	$p(\boldsymbol{\mu}, \boldsymbol{\Sigma})=\frac{\gamma^{D / 2}	\mathbf{\Psi}

Gamma Function:

$\Gamma(z)=\int_0^{\infty} t^{z-1}e^{-t} dt, \quad z\in \mathbb{C}$;

$\Gamma(n)=(n-1)!,\quad n\in\mathbb{Z}_{>0}$

Binomial distribution: $p(x)=\left( \begin{array}{l}{n} \ {x}\end{array}\right) \lambda^{x}(1-\lambda)^{n-x}$

Categorical distribution: $p(\mathbb{x})=\prod\limits_{k=1}^K \lambda_k^{x_k}=\lambda_k$

https://zhuanlan.zhihu.com/p/26638720

Fitting Probability Models

we already know some common parametric probability distributions $p(x|\theta)$.

Now, we need to know how to learn unknown parameters $\theta$ from given data $\mathcal{D}={x[1],\cdots,x_[N]}$. Then, use these parameters to make prediction.

Learning

Methods to learning the unknown parameters $\theta$ from data

Maximum Likelihood Estimation (MLE)

Given data $\mathcal{D}={x[1],\cdots,x_[N]}$
Assume:
- distribution ${p_\theta : \theta \in \Theta}$ where $p(\theta)=p(x|\theta)$
- $\mathcal{D}$ is sampled from $X_{1}, X_{2}, \ldots, X_{N} \sim p_{\theta^{}}$ for some $\theta^ \in \Theta$
- Random variables $X_{1}, X_{2}, \ldots, X_{N}$ are i.i.d
Goal: estimate $\theta^*$
Method: estimate $\theta_{MLE}$ is a maximum likelihood estimate (MLE) for $\theta^*$ if

$$\begin{align} \theta_{M L E}&=\underset{\theta \in \Theta}{\operatorname{argmax}}[p(\mathcal{D} | \theta)]\ &=\underset{\theta \in \Theta}{\operatorname{argmax}}[p(x | \theta)]\ &=\underset{\theta}{\operatorname{argmax}}\left[\prod_{i=1}^{N} p(x[i] | \theta)\right] \quad(\text { i.i.d }) \quad \longrightarrow \quad \text{decompose into products of likelihoods} \end{align} $$
Solution: maximize the logarithm $\theta_{MLE}=\underset{\theta}{\operatorname{argmax}}\left[\log\left(\prod_{i=1}^{N} p(x[i] | \theta)\right)\right]$. Then, solve by taking derivative w.r.t. $\theta$ equal to 0.
Advantages:
- Easy and fast to compute
- Nice Asymptotic properties**:**
  - Consistent: if data generated from $f(\theta^*)$, MLE converges to its true value, $\hat{\theta}_{MLE} \rightarrow \theta^*$ as $n \rightarrow \infty$
  - Efficient: there is no consistent estimator that has lower mean squared error than the MLE estimate
- Functional Invariance: if $\hat{\theta}$ is the MLE of $\theta^*$, and $g(\theta^*)$is a transformation of $\theta^*$ then the MLE for $\alpha=g(\theta^*)$ is $\hat{\alpha}=g(\hat{\theta})$
Disadvantages:
- MLE is a point estimate i.e., does not represent uncertainty over the estimate
- MLE may overfit.
- MLE does not incorporate prior information.
- Asymptotic results are for the limit and assumes model is correct.
- MLE may not exist or may not be unique
- not know uncertainty of the parameter estimation

Bayesian Approach

To model uncertainty of the parameter estimation

Fitting: Instead of a point estimate $\hat{\theta}$, compute the posterior distribution over all possible parameter values using Bayes’ rule:

$$p(\theta | D)=\frac{\prod_{i=1}^{N} p(x[i] | \theta) p(\theta)}{p(D)}$$

Principle: why pick one set of parameters? There are many values that could have explained the data. Try to capture all of the possibilities.

Model:

Goal: Model Uncertainty over $\theta$ using prior distribution
Method: find posterior

$p(\theta | D)=\frac{\prod_{i=1}^{N} p(x[i] | \theta) p(\theta)}{p(D)}$

$p(\theta | x)=\frac{\prod_{i=1}^{N} p(x[i] | \theta) p(\theta)}{p(x)}=\frac{\prod_{i=1}^{N} p(x[i] | \theta) p(\theta)}{\int \prod_{i=1}^{N} p(x[i] | \theta) p(\theta) d \theta}$

where

$\prod_{i=1}^{N} p(x[i] | \theta) p(\theta)=\prod_{i=1}^{N} \operatorname{Norm}{x[i]}\left[\mu, \sigma^{2}\right] \underbrace{\operatorname{NormlnvGam}{\mu, \sigma^{2}}[\alpha, \beta, \gamma, \delta]}_{\text{conjugate prior}}$ if we use Guassian as our observation model, i.e. $p(x|\theta)$

Properties:

models uncertainty over parameters.
incorporating prior information
can derive quantities of interest, $p(x<10|D)$
can perform model selection.

Drawbacks:

Need Prior
Computationally intractable
If initial belief not conjugate to normal likelihood.. bad

Maximum a Posterior Estimation (MAP)

Given: data $D={x[1], \cdots, x[N]}$

Assume:

joint distribution $p(D,\theta)$
$\theta$ is a random variable

Goal: Choose “good” $\theta$

Method:

estimate $\theta_{MAP}$ is a maximum a posterior (MAP) estimation if

$$\begin{align} \theta_{MAP}&=\underset{\theta}{\operatorname{argmax}} [p(\theta|D)]\ &=\underset{\theta}{\operatorname{argmax}}\left[\frac{p(D|\theta)p(\theta)}{p(D)} \right] \quad \text{Bayes’ rule}\&=\underset{\theta}{\operatorname{argmax}}\left[\frac{\prod_{i=1}^N p(x[i]|\theta)p(\theta)}{p(D)} \right]\quad \text{i.i.d}\ &=\underset{\theta}{\operatorname{argmax}}\left[ \prod_{i=1}^N p(x[i]|\theta)p(\theta) \right] \quad p(D)\ \text{ is removed since it is independent of}\ \theta \ \end{align}$$
$ \theta_{MAP} = \underset{\theta}{\operatorname{argmax}}\left[ \prod_{i=1}^N \underbrace{ p(x[i]|\theta)}{\text{likelihood}} \underbrace{p(\theta)}{\text{prior}} \right] $
maximize the logarithm. Then taking derivatieves and setting to zero $\longrightarrow$ parameters estimation

Results:

more data points $\rightarrow$ MAP is closer to MLE
Less data points $\rightarrow$ MAP is closer to Prior

key steps:

Prior distribution
Derive
$$
\mathcal{L}=\log p(D | \theta) p(\theta)=\log p(D | \theta)+\log p(\theta)
$$
Then set $\frac{\part L}{\part \theta_i}=0$ for each parameter $\theta_i$

Properties:

Fast and easy
incorporate prior information
Avoid overfitting (“regularization”) ?
As $n \rightarrow \infty$, MAP tends to look like MLE, but not have same nice asympototic properties.

Drawbacks:

Point estimation (like MLE)
- not capture uncertainty over estimates
NEED to choose Prior.
Not functionally invariant:
- if $\hat{\theta}$ is the MLE of $\theta^*$, and $g(\theta^*)$ is a transformation of $\theta^*$, then the MAP for $\alpha = g(\theta^*)$ is not necessarily $\hat{\alpha} = g(\hat{\theta})$

Prediction

MLE / MAP: evaluate new data point under parameter learnt by MLE /MAP.
Bayesian: calculate weighted sum of predictions from all possible values of parameters:
$$
\begin{aligned} p\left(x^{} | \mathcal{D}\right) &=\frac{p\left(x^{}, D\right)}{p(D)} \quad \text{(conditional probability)} \ &=\frac{\int p\left(x^{}, D, \theta\right) d \theta}{p(D)} \quad \text{(marginal probability)} \ &=\frac{\int p\left(x^{}, \theta | D\right) p(D) d \theta}{p(D)} \quad \text{(conditional probability)} \ &=\int p\left(x^{} | D, \theta\right) p(\theta | D) d \theta \quad \text{(conditional probability)} \ &=\int p\left(x^{} | D\right) p(\theta | D) d \theta \quad \text{(conditional independence)} \end{aligned}
$$
Predictive Density
- $$
  p\left(x^{} | D\right)=\int \underbrace{p\left(x^{} | \theta\right)}{\text{weights}} \underbrace{p(\theta | D) d \theta}{\text{prediction for each possible}\ \theta}
  $$
  
  Make a prediction that is an infinite weighted sum (integral) of the predictions for each parameter value, where weights are probabilitys.
  
  consider MLE and MAP estimates as probability distributions with zero probability everywhere except at estimate

Training data decrease, Bayesian become less certain, but MAP is wrongly overconfident.

Exponential Family

exponential family:

a set of probability distributions ${p_\theta: \theta \in \Theta}$ with the form
$$
p_{\theta}(x)=\frac{h(x) \exp \left[-\eta(\theta)^{\top} s(x)\right]}{Z(\theta)}
$$
Where
- $\theta \in \mathbb{R}^{k}, x \in \mathbb{R}^{d}$
- natural parameters $\eta(\theta) : \Theta \rightarrow \mathbb{R}^m$
- sufficient statistics: $s(x): \mathbb{R}^d \rightarrow \mathbb{R}^m$
- base measure (supporting and scaling) : $h(x):\mathbb{R}^d \rightarrow [0,\infty)$
- partition function: $Z(\theta):\Theta \rightarrow [0,\infty)$

an exponential family is in its natural (canonical 标准的) form if it is parameterized by its natural parameters $\eta$ :
$$
p_{\eta}(x)=p(x | \eta)=\frac{h(x) \exp \left[-\eta^{\top} s(x)\right]}{Z(\eta)}
$$
Normaliser: $Z(\eta)$
$$
Z(\eta)=\int h(x) \exp \left[-\eta^{\top} s(x)\right] d x
$$
Properties:

has conjugate prior.
fixed number of sufficient statistics that summarize iid data. 指数函数的充分统计量的可以从大量的i.i.d.数据中归结为估计的几个值（即 $s (x)$ )
Posterior predictive distribution always has closed form solution
共轭先验性质给出了后验概率分布的闭式解，否则我们需要求解复杂的积分。而且，共轭先验使得我们能够清楚的看到似然函数对概率分布的影响。

For any ExpFam:
$$
\mathbb{E}[s(x)]=\nabla \log \mathrm{Z}(\eta)
$$

通过对$Z(\eta)$求导，容易得到充分统计量$s(x)$的均值，方差和其他性质。

Important Properties:

Many distributions are Exponential Family:

PMF	PDF
• Bernoulli • Binomial
• Categorical/Multinoulli
• Poisson
• Multinomial
• Negative Binomial • Negative Binomial	• Normal • Gamma & Inverse Gamma
• Wishart & Inverse Wishart
• Beta
• Dirichlet
• lognormal
• Exponential
•…

Bayesian Networks

Conditional Independence

Fitting probability models (learning) and predictive density (inference)

BUT just at the case of One random variables, i.e. $p(x|\theta)$
How about joint probability with $N$ random variables?

Problem:

Assume $N$ discrete random variables $x_1, \cdots, x_N$, where $x_i\in{1,\cdots,K}$. $\longrightarrow$ need $O(K^N)$ parameters to represent the joint distribution $p(x_1, \cdots, x_N)$ $\longrightarrow$ hard to infer, need huge amount of data to learn all parameters
If all random variables are independent $\longrightarrow$ $O(NK)$ with $p(x_1,\cdots, x_N|\theta) = \prod_{i=1}^N p(x_i|\theta_i)$ $\longrightarrow$ can infer, smaller data

###Definition:

Two random variables $X_A$ and $X_C$ are conditionally independent given $X_B$, $X_{A} \perp X_{C} | X_{B}$ iff $p\left(x_{A}, x_{C} | x_{B}\right)=p\left(x_{A} | x_{B}\right) p\left(x_{C} | x_{B}\right)$ OR $p\left(x_{A} | x_{B}, x_{C}\right)=p\left(x_{A} | x_{B}\right), \quad \forall X_{B} : p\left(x_{B}\right)>0$, which means learning $X_C$ does not change prediction of $X_A$ once we know the value of $X_B$.

###Meanings:

parameter reduction
- let $m_i$ denotes the number of parents of node $X_i$, and each node takes on $K$ values.
- conditional probability of $X_i$ use $K^{m_i+1}$ parameters
- Sum for all nodes, parameter reduction from $O(K^N)$ to $O(K^{m+1})$, and $m \ll N$

Learning with conditional independence: MLE

represent the joint probability distribution
Taking the logarithm of $p(x|\theta)$ converts the product into sums
find the maximum log-likelihood of each parameter $\theta_i$ . this can be executed seperatively.

$\underset{\theta_1}{\operatorname{argmax} \log p(x|\theta)}$, $\underset{\theta_2}{\operatorname{argmax} \log p(x|\theta)}$, $\cdots$, [other random varibles which independent of $\theta-1$ can be removed.]

Learning with conditional independence: MAP

Similar as MLE:
$$
\begin{aligned} \underset{\theta_{i}}{\operatorname{argmax}} \log p(\theta | x) &=\underset{\theta_{i}}{\operatorname{argmax}} \log p(x | \theta) p(\theta) \ &=\underset{\theta_{i}}{\operatorname{argmax}}\left{\log p\left(x_{i} | x_{\pi_{i}}, \theta_{i}\right)+\log p\left(\theta_{i}\right)\right} \end{aligned}
$$
where $\theta=\left(\theta_{1}, \ldots, \theta_{M}\right)$

Bayesian Networks

Bayesian Network is a DAG $\mathcal{G}$ (no cycle)
- node is variables $X_i$, set of parent nodes $X_{\pi_i}$: all variables that are parents of $X_i$
- Topological ordering: if $X_i \rightarrow X_j$, then $i<j$ [not unique]
Path: {path follow the arrows cosidering direction}. V.S. Trail: {path follow the edges without considering the direction}
Ancestors: parents, grand-parents , etc. V.S. Descendants: children, grand-children, etc.
Local Markov assumption :
- Each random variable $X_i$ is independent of its non-descendants $X_{nonDesc(X_i)}$ given its parents $X_{\pi_i}$
- $$
  \left{X_{i} \perp\left(X_{\text { nonDesc }\left(X_{i}\right)} \backslash X_{\pi_{i}}\right) | X_{\pi_{i}}\right}
  $$
- Given parents, $X_i$ 与非后代 $X_{nonDesc(X_i)}$ 独立。
- Then, conditional probability $p\left(x_{i} | x_{\pi_{i}}\right), \quad i=1, \cdots,N$ [according to local parent-children relationship]

Joint probability: Product of all local conditional independence
$$
p\left(x_{1}, \ldots, x_{N}\right)=\prod_{i=1}^{N} p\left(x_{i} | x_{\pi_{i}}\right)
$$
because $p\left(x_{1}, \ldots, x_{N}\right)=p\left(x_{1}\right) \prod_{i=2}^{N} p\left(x_{i} | x_{x_{1}, \ldots, x_{i-1}}\right) \quad \text { (chain rule) }$

a fully connected DAG can represent any distribution over its random variables.

Interprete arrows as “possible dependence”

Interpreting/Analyzing Bayesian Networks

already know the conditional independence based on local parent-children relationship.
hope to know other extra independece

Methods：

prove $p\left(x_{A}, x_{C} | x_{B}\right)=p\left(x_{A} | x_{B}\right) p\left(x_{C} | x_{B}\right)$ OR $p\left(x_{A} | x_{B}, x_{C}\right)=p\left(x_{A} | x_{B}\right), \quad \forall X_{B} : p\left(x_{B}\right)>0$ for $X_{A} \perp X_{C} | X_{B}$
Graph seperation

Graph seperation

Precise definition of “blocking” has to be done through the “three canonical 3-node graphs”, and “d-separation”.

Three canonical 3-node graphs:

Head-Tail
- if none variables are observed, A and B are not independent $A \notperp B | \emptyset$
  $$
  p(a, b)=p(a) \sum_{c} p(c | a) p(b | c)=p(a) p(b | a)
  $$
- if C observed, using Bayes’ rule, obtain $A \perp B | C$
  $$
  \begin{aligned} p(a, b | c) &=\frac{p(a, b, c)}{p(c)} \ &=\frac{p(a) p(c | a) p(b | c)}{p(c)} \ &=p(a | c) p(b | c) \end{aligned} \quad \text { (Bayes rule) }
  $$
- Intuition: past A is independent of future B given present C, $\rightarrow$ simple Markov Chain
Tail-Tail
- If none variables are observed, NOT independent, $A \notperp B | \emptyset$
- If C observed, obtain $A \perp B | C$ , because
- $$
  \begin{aligned} p(a, b | c) &=\frac{p(a, b, c)}{p(c)} \ &=p(a | c) p(b | c) \end{aligned}
  $$
- Then,
Head-Head
- if none variables observed, we obtain $A \perp B | \emptyset$ because
  $$
  p(a, b)=p(a) p(b)
  $$
- if C observed, we obtain $A \notperp B | C$ , because
  $$
  \begin{aligned} p(a, b | c) &=\frac{p(a, b, c)}{p(c)} \ &=\frac{p(a) p(b) p(c | a, b)}{p(c)} \end{aligned}
  $$
  can not obtain $p(a)p(b)$
- “V-structure”.
- if C unobserved, “blocks”. If C observed, “unblocks”.
- Observation of any descendent node of C “unblocks” the path from A to B.
$A \perp B |C$ if all trails from nodes in A are “Blocked” from nodes in B when all C are observed.

A is d-separated form B by C
Key: Independence:
1. The arrows on the trail meet either head-to-tail or tail-to-tail at the node, and the node is in the set C, or
2. The arrows meet head-to-head at the node, and neither the node, nor any of its descendants, is in the set C.

Bayes Ball

it’s a “reachability” algorithm”:
1. shade the nodes in set $C$ (given nodes)
2. Place a ball at each of the nodes in set $A$
3. Let the balls “bounce around” the graph according to the d-separation rules:
  - if none of the balls reach B, then $A \perp B|C$
  - Else $A \norperp B|C$
can be implement as a breadth-first search
procedure of algorithm:
- given Graph $G$ and sets of nodes $X, Y$, and $Z$
- Output True if $X\perp Y | Z$, False otherwise.
- 2 Phases:
  - Phase 1: “Find all the unblocked v-structures”
    - traverse the graph from the leaves to the roots, marking all nodes that are in $Z$ or have descendants in $Z$.
  - Phase 2: “Traverse all the trails starting from $X$”
    - apply breadth-first-search, stopping a specific traversal when we hit a blocked node.
    - If $Y$ is reached during. this traversal, return False. Else, return True.

Bayes Network Examples:

Linear Regression

Model:
- $$
  Y[i]=\mathbf{w}^{\top} \mathbf{x}[i]+\epsilon[i]
  $$
- where input vector: $\mathbf{x}[i]=\left[\mathrm{x}[i]{1}, \mathrm{x}[i]{2}, \ldots, \mathrm{x}[i]{\mathrm{D}}\right]^{\top}$; coeffiecient vector: $\mathbf{w}=\left[w{1}, w_{2}, \ldots, w_{D}\right]^{\top}$; noise: $\epsilon[i] \sim N\left(0, \sigma_{n}^{2}\right)$
- $\Longrightarrow$
- if $\epsilon \sim N\left(0, \sigma_{n}^{2}\right)$ , then $Y \sim N\left(\mathbf{w}^{\top} \mathbf{x}, \sigma_{n}^{2}\right)$
- Affine property of Gaussian:
  - $f(x)=ax+b$ of Gaussian, $\rightarrow$ , new PDF $\mathcal{N}(a\mu+b,a^2\sigma^2)$
- Independence assertion:
  $$
  Y[i] \perp Y[i+1] |\mathbb{x}[i], \mathbb{w}, \sigma_n^2
  $$
  write the factorization:
  $$
  p(y[1],\cdots,y[N])=\prod\limits_i^N p(y[i]| \mathbb{w}^\top\mathbb{x}[i], \sigma_n^2)
  $$
- Problem: Assume we know $\sigma_n^2$, how to learn unknown parameter $\mathbb{w}$?
  - Approach 1: MLE
  $$
  \theta_{MLE}=\underset{\theta}{\operatorname{argmax}}\left[\prod_{i=1}^{N} p(x[i] | \theta)\right] \quad \text{iid}
  $$
  
  with $p\left(x | \mu, \sigma^{2}\right)=\operatorname{Norm}_{x}\left[\mu, \sigma^{2}\right]=\frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp -\frac{(x-\mu)^{2}}{2 \sigma^{2}}$
  
  logarithm, and compute derivate for $\mathbb{w}$
  $$
  \log \prod_{i=1}^{N} p(y[i]| \mathbb{w}^\top\mathbb{x}[i], \sigma_n^2) = \sum_{i=1}^N \log \mathcal{N}([i]| \mathbb{w}^\top\mathbb{x}[i], \sigma_n^2)\
  \propto-\sum_{i=1}^N \frac{(y[i]-\mathbb{w}^\top x[i])^2}{2\sigma_n^2}
  $$
  So,
  $$
  \underset{\theta}{\operatorname{argmin}} \sum\limits_{i=1}^N \frac{(y[i]-\mathbb{w}^\top x[i])^2}{2\sigma_n^2} = \frac{1}{2\sigma_n^2} \underset{\theta}{\operatorname{argmin}}(\mathbf{Y}-\mathbf{X}^\top \mathbb{w})^\top (\mathbf{Y}-\mathbf{X}^\top \mathbb{w})
  $$
  Set $\mathcal{L}(\mathbb{w})=\mathbf{Y}-\mathbf{X}^\top \mathbb{w})^\top (\mathbf{Y}-\mathbf{X}^\top \mathbb{w})$
  $$
  \nabla \mathcal{L}(\mathbb{w})=(\mathbb{w}^\top(\mathbf{X}^\top\mathbf{X})-\mathbf{Y}^\top\mathbf{X}) = 0\
  \text{so,} \quad \mathbb{w}=(\mathbf{X}\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{Y}
  $$
  对ppt有疑问，手写的那一页， p41
**Model Uncertainty of $\mathbb{w}$ **
- Model:
  - $p(\mathbf{w} | v)=N(\mathbf{0}, v \mathbf{I})$
  - Conditional independence assertation:
    $$
    p(y[1], \ldots, y[N], \mathbf{w}) =p(\mathbf{w} | v) \prod\limits_{i}^N p(y[i]| \mathbf{w}^{\top} \mathbf{x}[i], \sigma_{n}^{2} )
    $$
- Problem: assume know $\sigma_n^2$, MAP solution of $\mathbb{w}$
  - Same as MLE, derivative, equal to 0.
  - $$
    \mathbb{w}=(\mathbf{Y}^\top\mathbf{X}+2\lambda)^{-1}\mathbf{X}^\top\mathbf{Y},\quad\text{where}\ \lambda=\frac{\sigma_n^2}{v}
    $$
  - if prior is laplacian distribution: $p(\mathbb{w}|0,b)=\frac{1}{2b}\exp (-\frac{|w|}{b})$
    - $\log p(\mathbb{w}|0,b)\propto -\frac{|w|}{b}$ , then, have $\lambda = -\frac{2\sigma_n^2}{b}$ , form $l_1$-norm.
  - derivative of $l_0$-norm: This is zero (as you pointed out), but only in places where it isn’t interesting. Where it is interesting it’s not differentiable (it has jump discontinuities).
  - derivative of $l_1$-norm: This norm is not differentiable with respect to a coordinate where that coordinate is zero. Elsewhere, the partial derivatives are just constants, $\pm 1$ depending on the quadrant.
  - derivative of $l_2$-norm: Usually people use the $l_2$-norm squared so that it’s differentiable even at zero. The gradient of $||x||^2$ is $2x$, but without the square it’s $\frac{x}{||x||}$ (i.e. it just points away from zero). The problem is that it’s not differentiable at zero.
Predictive DGM Model

Naiive Bayes

Model:
- Class $c \in{1, \ldots, K}$, and input $\mathbf{x}=\left[x_{1}, x_{2}, \ldots, x_{D}\right]^{\top}$
- naiive bayes assume that $p(\mathbf{x} | c)=\prod_{j} p\left(x_{j} | c\right)$. Naive assumption: all features are independent
- Filled dots are deterministic parameters. If we have priors, we can make parameters random variables (open circles)
  
  $\Longrightarrow$
- Even the assumption for naiive bayes is so strong, it still works well?
  - Key idea: it works well when the local feature dependencies between the two classes “cancel out”.

##Properties of Bayesian network

Independence-Maps (I-maps)
- Theorem:
  
  Let $P$ be a joint distribution over $\mathcal{X}$ and $G$ be a Bayesian Network structure over $\mathcal{X}$. If $P$ factories according to $G$, then the local independence assertation $\mathcal{J}_l(G)\subseteq \mathcal{J(P)}$.
  - The local Markov independence $\mathcal{J}l(G)$ is the set of all basic conditional independence assertions of the form $\left{X{i} \perp\left(X_{\text { nondesc }\left(x_{i}\right)} \backslash X_{\pi_{i}}\right) | X_{\pi_{i}}\right}$
- Proof:
  - $\mathcal{J}l(G)\subseteq \mathcal{J(P)}$ means $p(X_i|X\setminus X_i)=p(X_i|X{\pi_i})$
    $$
    \begin{align}
    p(X_i|X\setminus X_i)=&\frac{p(X_1, \cdots,X_N)}{P({X_1,\cdots, X_N}\setminus X_i)}\
    &=\frac{\prod\limits_{i=1}^N p(X_i|X_{\pi_i})}{\sum\limits_{X_i} \prod\limits_j p(X_j|X_{\pi_j})}\
    &=\frac{p(X_i|X_{\pi_i}) \prod\limits_j p(X_j|X_{\pi_j})}{\sum\limits_{X_i}p(X_i|X_{\pi_i}) \prod\limits_j p(X_j|X_{\pi_j})}\
    &=\frac{p(X_i|X_{\pi_i})}{\sum\limits_{X_i}p(X_i|X_{\pi_i})} \quad \text{because}\left{\sum\limits_{X_i}p(X_i|X_{\pi_i})=1\right}\
    &=p(X_i|X_{\pi_i})
    \end{align}
    $$
  - So, $\left{X_{i} \perp\left(X_{\text { nondesc }\left(x_{i}\right)} \backslash X_{\pi_{i}}\right) | X_{\pi_{i}}\right}$
Global Markov independence:
- Definition: the set of all independencies that correspond to d-seperation in graph $G$ is the set of global Markov independencies:
  $$
  \mathcal{J}(G)=\left{(X \perp Y | Z) : \operatorname{dsep}_{\mathrm{G}}(X ; Y | Z)\right}
  $$
- Soundness:
  - Definition: if a distribution $P$ factorizes according to $G$, then $\mathcal{J}_l(G)\subseteq \mathcal{J(P)}$.
  - i.e. two nodes are d-separated given $Z$, they are conditionally independent given $Z$.
- Completeness:
  - Definition: $P$ is faithful to $G$ if for any conditional independence $(X \perp Y | Z) \in \mathcal{J}(P)$, then $\operatorname{dsep}_{G}(X ; Y | Z)$.
  - i.e. any independence can be reflect as a d-separation.
  - Definition: weak completeness
  - Definition: almost completeness
- How Powerful:
  
  Can find a Bayes network to $G$ represent all conditional independencies for a given $P$?
  - Minimal I-map: “no redundant edges”
  - perfect map:
  No

Undirected Graphical Models (Markov Random Fields)

##Undirected Graphical Models:

UGM (MRF) or (Markov Network) is graph $\mathcal{G}(\mathcal{V},\mathcal{E})$, set of nodes and set of undirected edges

Parameterization: via factors.
- Factor $\varphi(\mathcal{C})$ function map from set of variables to non-negative numbers.
Factorization:
- $$
  p\left(x_{1}, \ldots, x_{N}\right)=\frac{1}{Z} \prod_{j=1}^{M} \varphi_{j}\left(\mathcal{C}_{j}\right)
  $$
  
  DGM is $p\left(x_{1}, \ldots, x_{N}\right)=\prod\limits_{i=1}^{N} p\left(x_{i} | x_{i}\right)$
- just like DGM, UGM encode a set of conditional independence assertions.
Conditional Independence:
- 3 Markov properties of UGMs:
  1. Global Markov Property
    - $X_{A} \perp X_{B} | X_{C}$ iff $C$ separates $A$ from $B$.
    - NO trails connect $A$ and $B$ when remove all nodes in $C$.
  2. Local Markov Property
    - Markov Blanket $\operatorname{mb}(X_s)$: set of nodes that renders a node $X_s$ conditionally independent of all the other nodes:
      $$
      X_{S} \perp \overbrace{\mathcal{V} \backslash\left{\mathrm{mb}\left(X_{S}\right), X_{S}\right}}^{\text{all other nodes in}\ G} | \mathrm{mb}\left(X_{S}\right)
      $$
    - Markov blanket in UGM is the set of immediate neighbors
      - node 和除了本身以及neighbors 的其他node 都 conditionally independent. {due to global Markov property}
      Markov Blankets for a node $X_s$ in a DGM is the set of node’s parents, children, and co-parents (other parents of children).
      
      对ppt有疑问， p35
  3. Pairwise Markov Property
    - two nodes $X_s$ and $X_t$ are conditionally independent given the rest if there is no direct edge between them
      $$
      X_{s} \perp X_{t} | \mathcal{V} \backslash\left{X_{S}, X_{t}\right}, \text { where } \mathcal{E}_{s t}=\emptyset
      $$
      移除其他nodes后无直接连线
- relationship among three properties.
  - Interrelated.
  - Global Markov $\Longrightarrow$ local Markov $\Longrightarrow$ Pairwise Markov $\Longrightarrow$ Global Markov
    - for positive distribution $p(\mathbb{x})>0$, the three Markov properties are equavalent.

Parameterization of MRFs

Aim: obtain a local parameterization of UGMs.

For DGM:

Local conditional probabilities: $p(x_i|x_{\pi_i})$

Joint probability is product of local conditional probabilities: $p\left(x_{1}, \ldots, x_{N}\right)=\prod_{i=1}^{N} p\left(x_{i} | x_{\pi_{i}}\right)$

But for UGM, difficult to obtain conditional probabilities since no topological ordering. So, discard conditional probabilities.

Represent the joint as a product of local functions.

Normalize
Factors NOT represent marginal/conditional distributions

Gibbs distribution

Definition: A distribution $P$ is a Gibbs distribution parameterized by a set of factors ${\varphi_1(\mathcal{C}1), \varphi_2(\mathcal{C}2),\cdots, \varphi_m(\mathcal{C}m)}$, where
$$
p\left(x{1}, x{2}, \ldots, x{N}\right)=\frac{1}{Z} \prod_{j=1}^{M} \varphi_{j}\left(\mathcal{C}_{j}\right)
$$
Factor:

Two nodes $X_i$ and $X_j$ that are not directly linked in UGM are conditionally independent given all other nodes. [不直接相连的nodes是条件独立的]

$\rightarrow$ Factorization of joint probability that places $X_i$ and $X_j$ in different factors.

All nodes $X_C$ in a maximal clique $C$ in UGM form factor (local function) $\varphi(X_C)$
- Clique: a fully connected subset of nodes.
- Maximal Clique $C$: cliques that cannot be extended to include additional nodes
Hammersley-Clifford: a positive distribution $p(y)>0$ satisfies the CI properties of undirected graph $\mathcal{H}$ iff $p$ can be represented as a product of factors, one per maimal clique:
- $$
  p(\mathbf{y} | \boldsymbol{\theta})=\frac{1}{Z(\boldsymbol{\theta})} \prod_{c \in \mathcal{C}} \psi_{c}\left(\mathbf{y}{c} | \boldsymbol{\theta}{c}\right)
  $$
  
  where $\mathcal{C}$ is the set of all the maimal liques; $\varphi_C(\cdot)$ is the factor or potential function of clique $c$; $\theta$ is the parameter of factors $\varphi_C(\cdot)$ for $c\in\mathcal{C}$; $Z(\boldsymbol{\theta})$ is the partition function $Z(\boldsymbol{\theta}) \triangleq \sum \prod\limits_{c \in \mathcal{C}} \psi_{c}\left(\mathbf{y}{c} | \boldsymbol{\theta}{c}\right)$.
  
  UGM NOT specify a unique factorization.
  
  Hammersley-Clifford is just one type of factorization.
Pairwise MRF:
- parameretization to the edges of the graph, rather than maximal cliques. [每条边作为factor，不考虑maximal clique]
- SIMPLE but not general.
Properties:
- Independent set; Independence map (I-map);
- Soundness
  - if $P$ is a Gibbs distribution over $H$, then $H$ is an I-map for $P$.
    - Hammersley-Clifford states that if $H$ is an I-map for $P$, then $P$ is a Gibbs distribution over $H$ (for positive distribution)
- Weak completeness
  - if $X$ and $Y$ are not separated in $H$, then there is some distribution $P$ that factories over $H$ where $X$ and $Y$ are dependent.
- How Powerful: ? represent all conditional independencies ?
  - Perfect map: $\mathcal{J}(G)=\mathcal{J}(P)$
  - UGM are NOT “richer” or “more powerful” than DGM.
representing Potential (local) function
- Log potentials is a linear function of parameters:
- $$
  \log \psi_{c}\left(\mathbf{y}{c}\right) \triangleq \phi{c}\left(\mathbf{y}{c}\right)^{T} \boldsymbol{\theta}{c}
  $$
  
  $\phi_C(y_C)$ is a feature vector derived from the values of the variables $y_C.$
  
  Then, log probability is
  $$
  \log p(\mathbf{y} | \boldsymbol{\theta})=\sum_{c} \phi_{c}\left(\mathbf{y}{c}\right)^{T} \boldsymbol{\theta}{c}-\log Z(\theta)
  $$
  also called “maximum entropy” or “log-linear” model.
every MRF is an [exponential family](#Exponential Family).

Example of MRF

Model:

Ising and Potts (Depth Map from Stereo Images (Multi) or Binary Image Denosing)

$$ \begin{aligned} p(y, x | J, \theta) =p(\mathbf{y} | J) \prod_{t} p\left(x_{t} | y_{t}, \boldsymbol{\theta}\right) =\left[\frac{1}{Z(J)} \underbrace{\prod_{s \sim t} \psi\left(y_{s}, y_{t} ; J\right)}_{\text{Pairwise Potential}}\right] \underbrace{\prod_{t} p\left(x_{t} | y_{t}, \theta\right)}_{\text{Unary potential}} \end{aligned} $$ 其实不太懂why

Parameter Learning

UGM (MRF) in log-linear form, where $c$ indexes the cliques:
$$
p(\mathbf{y} | \boldsymbol{\theta})=\frac{1}{Z(\boldsymbol{\theta})} \exp \left(\sum_{c} \boldsymbol{\theta}{c}^{T} \boldsymbol{\phi}{c}(\mathbf{y})\right)
$$
The scaled log-likelihood is given by
$$
\ell(\boldsymbol{\theta}) \triangleq \frac{1}{N} \sum_{i}^{N} \log p\left(\mathbf{y}{i} | \boldsymbol{\theta}\right)=\frac{1}{N} \sum{i}^{N}\left[\sum_{c} \boldsymbol{\theta}{c}^{T} \boldsymbol{\phi}{c}\left(\mathbf{y}_{i}\right)-\log Z(\boldsymbol{\theta})\right]
$$

belongs to exponential family,
convex function in $\theta$, so can find unique global maximum by gradient-based optimizers.
$$
\frac{\partial \ell}{\partial \boldsymbol{\theta}{c}}=\frac{1}{N} \sum{i}^{N}\left[\boldsymbol{\phi}{c}\left(\mathbf{y}{i}\right)-\frac{\partial}{\partial \boldsymbol{\theta}_{c}} \log Z(\boldsymbol{\theta})\right]
$$

对ppt有疑问， p96，怎么求的导？
for exponential family distribution:
- $$
  \mathbb{E}[s(x)]=\nabla \log Z(\eta)
  $$
- if $s(x)=x$ (natural exponential family), we can find moments of $x$ simply by differentiation.
- the derivative of the log partition function w.r.t. $\theta_c$ is the expectation of the $c^{th}$ feature under the model:
- $$
  \frac{\partial \log Z(\boldsymbol{\theta})}{\partial \theta_{c}}=\mathbb{E}\left[\phi_{c}(\mathbf{y}) | \boldsymbol{\theta}\right]=\sum_{\mathbf{y}} \phi_{c}(\mathbf{y}) p(\mathbf{y} | \boldsymbol{\theta})
  $$
- Proof:
- $$
  \begin{align}
  \frac{\partial \log Z(\theta)}{\partial \theta_{c}} &= \frac{1}{Z(\theta)} \frac{\partial Z(\theta)}{\partial \theta_{c}}, \quad \text { where } \quad Z(\theta) = \sum_{y} \exp \left(\sum_{c} \theta_{c}^{T} \phi_{c}(y)\right)\
  \Rightarrow \frac{\partial Z(\theta)}{\partial \theta_{c}} &= \sum_{y} \exp \left(\sum_{c} \theta_{c}^{T} \phi_{c}(y)\right) \phi_{c}(y)\
  \Rightarrow \frac{\partial \log Z(\theta)}{\partial \theta_{c}} &= \frac{1}{Z(\theta)} \sum_{y} \phi_{c}(y) \exp \left(\sum_{c} \theta_{c}^{T} \phi_{c}(y)\right) =\sum_{y} \phi_{c}(y) \underbrace{\frac{1}{Z(\theta)} \exp \left(\sum_{c} \theta_{c}^{T} \phi_{c}(y)\right)}{p(y|\theta)} = \sum{y} \phi_{c}(y) p(y | \theta)
  \end{align}
  $$
- So, the gradient of the log-likelihood is
- $$
  \frac{\partial \ell}{\partial \boldsymbol{\theta}{c}}=\left[\underbrace{\frac{1}{N} \sum{i}^{N} \boldsymbol{\phi}{c}\left(\mathbf{y}{i}\right)}{\text{Clamped term}}\right] - \underbrace{\mathbb{E}\left[\boldsymbol{\phi}{c}(\mathbf{y})\right]}_{\text{Unclamped/contrastive term}}
  $$
  
  Clamped term: $y$ is fixed to its observed values; Unclamped/contrastive term: $y$ is a free variable.
  
  Unclamped term requires inference in the model, once per gradient step, and this makes UGM learning much slower than DGM.
- Gradient of log-likelihood rewrite:
- $$
  \frac{\partial l}{\partial \theta_{c}}=E_{p_{e m p}}\left[\phi_{c}(y)\right]-E_{p(y | \theta)}\left[\phi_{c}(y)\right]
  $$
  - $E_{p_{e m p}}\left[\phi_{c}(y)\right]=\frac{1}{N} \sum_{n=1}^{N} \phi_{c}\left(y_{i}\right)$ : Expected feature vector according to empirical distribution.
  - $E_{p(y | \theta)}\left[\phi_{c}(y)\right]$ : Expected feature vector according to model’s distribution.
  At optimum, gradient =0, i.e. $E_{p(y | \theta)}\left[\phi_{c}(y)\right]=\sum_{y} \phi_{c}(y) p(y | \theta)$. But this equation cannot solve for unknown $\theta$.
- Solution: Gradient-based optimizers.
- $$
  \frac{\partial l}{\partial \theta_{c}}=E_{p_{e m p}}\left[\phi_{c}(y)\right]-E_{p(y | \theta)}\left[\phi_{c}(y)\right]
  $$
  
  But, when computing $E_{p(y | \theta)}\left[\phi_{c}(y)\right]$ requires sum over all states of $y$ , which is intractable.
- Solution: combine approximate inference with gradient-based learning. i.e. Stochastic Maximum Likelihood
Stochastic Maximum Likelihood:

iteratively updates the parameter $\theta_{k+1}$ at the $k$ step using the parameter and gradient from previous step:
$$
\theta_{k+1} \leftarrow \theta_{k}-\eta g_{k}
$$
where $\eta$ is step size or learning rate, $g_k\approx \frac{\part l}{\part \theta_C}$ is the gradient that can be approximated with Markov Chain Monte Carlo (MCMC), sampling.
Maximum A posterior (MAP):
- add prior:
- $$
  \underset{\theta}{\operatorname{argmax}}\left{\sum_{i} \log p\left(y_{i} | \theta\right)+\log \underbrace{p(\theta)}_{\text{prior}}\right}
  $$
  
  Choose prior with hyper-parameters.

Conditional Random Fields (CRF)

CRF or discriminative random field

a version of MRF where all the clique potentials are conditioned on input feature $X$:
$$
p(\mathbf{y} | \mathbf{x}, \mathbf{w})=\frac{1}{Z(\mathbf{x}, \mathbf{w})} \prod_{c} \psi_{c}\left(\mathbf{y}_{c} | \mathbf{x}, \mathbf{w}\right)
$$
- Log-linear representation of potentials:
- $$
  \psi_{c}\left(\mathbf{y}{c} | \mathbf{x}, \mathbf{w}\right)=\exp \left(\mathbf{w}{c}^{T} \boldsymbol{\phi}\left(\mathbf{x}, \mathbf{y}_{c}\right)\right)
  $$
  
  where $\phi(\mathbb{x}, y_C)$ is a feature vector derived from the global inputs $X$ and the local set of labels $Y_C$.
Advantages: (compare to generative models)
- No need to waste resources, i.e. modelling things that we can always observe.
  - Focus our attention on modeling what we care about, i.e. the distribution of labels given the data.
- We can make the potentials (or factors) of the model be data-dependent.
  - e.g., in natural language processing problems, we can make the latent labels depend on global properties of the sentence, such as which language it is written in.
Disadvantages:
- require labeled training data
- learning is slower.
Parameter learning:

Consider CRF in log-linear form:
$$
p(\mathbf{y} | \mathbf{x}, \mathbf{w})=\frac{1}{Z(\mathbf{x}, \mathbf{w})} \prod_{c} \exp \left(\mathbf{w}{c}^{T} \phi{c}\left(\mathbf{x}, \mathbf{y}{c}\right)\right)
$$
where $\phi{c}\left(\mathrm{x}, y_{c}\right)$ is a feature vector derived from the global inputs $\mathbb{x}$ and the local set of labels $y_c$.

Scaled log-likelihood:
$$
\begin{aligned} \ell(\mathbf{w}) & \triangleq \frac{1}{N} \sum_{i}^{N} \log p\left(\mathbf{y}{i} | \mathbf{x}{i}, \mathbf{w}\right) =\frac{1}{N} \sum_{i}^{N}\left[\sum_{c} \mathbf{w}{c}^{T} \boldsymbol{\phi}{c}\left(\mathbf{y}{i}, \mathbf{x}{i}\right)-\log Z\left(\mathbf{w}, \mathbf{x}{i}\right)\right] \end{aligned}
$$
Gradient:
$$
\frac{\partial \ell}{\partial \mathbf{w}{c}} =\frac{1}{N} \sum_{i}^{N}\left[\boldsymbol{\phi}{c}\left(\mathbf{y}{i}, \mathbf{x}{i}\right)-\frac{\partial}{\partial \mathbf{w}{c}} \log Z\left(\mathbf{w}, \mathbf{x}{i}\right)\right] =\frac{1}{N} \sum{i}^{N}\left[\boldsymbol{\phi}{c}\left(\mathbf{y}{i}, \mathbf{x}{i}\right)-\mathbb{E}\left[\boldsymbol{\phi}{c}\left(\mathbf{y}, \mathbf{x}{i}\right)\right]\right]
$$
Need labeled pairs of data $(y_i,x_i){i=1}^N$ for learning. And $\mathbb{E}\left[\boldsymbol{\phi}{c}\left(\mathbf{y}, \mathbf{x}{i}\right)\right] =\sum_{\mathbb{y}, \mathbb{x}{i}} p\left(\mathbb{y} | \mathbb{x}{i}, \mathbb{w}\right) \boldsymbol{\phi}{c}\left(\mathbb{y}, \mathbb{x}{i}\right)$
- depend on inputs $\mathbb{x}_i$
- Need inference for every single training case inside each gradient step, which is $O(N)$ times slower than MRF.
- Also use MAP to estimate unknown parameter $\underset{w}{\operatorname{argmax}}\left{\sum \log p\left(y_{i} | x_{i}, w\right)+\log p(w)\right}$ by choosing a prior.

Probabilistic inference

calculate $p(X_F|X_E)$ for arbitrary subset $E$ and $F$

黑体

3. RL algorithms

Posted on 2020-04-30 Edited on 2021-08-16 In Reinforcement Learning

DQN and variants

DQN

Value: (state, reward, state_new)

$Q(s,a)$ 与当前得到的reward 和 $Q(s’,a)$有关。

价值函数近似：

$Q(s,a)=f(s,a)$, here, $f$ can be represented by NN. –> $Q(s,a)=f(s,a,w)$ where $w$ is hte parameter of NN.

How to get the state_new? we trial and error –> have state_new from previous experiences.

流程；

首先环境会给出一个obs，智能体根据值函数网络得到关于这个obs的所有Q(s,a)，然后利用$ϵ$−greedy选择action并做出决策，环境接收到此action后会给出一个奖励Rew及下一个obs。这是一个step。此时我们根据Rew去更新值函数网络的参数。接着进入下一个step。如此循环下去，直到我们训练出了一个好的值函数网络。

Q value的更新依靠$Q_{target}$:（利用reward和Q计算出来的目标Q value）, $R_{t+1}+\gamma\max_a Q(s_{t+1},a)$.

所以，我们把$Q_{target}$作为标签，使$Q$趋近于$Q_{target}$.

所以，loss function is $\ell(w)=\mathbb{E}[\underbrace{r+\gamma\max_{a’}Q(s’,a’,w) }{Q{target}}- Q(s,a,w)]$.

具体流程：

每次只要传入一组（s,a,r,s’），即当前所处状态s，当前选择的动作a，做出动作a后获得的奖励r，以及做出动作a后转移到的下一状态s’。这四个值都可以在模拟一局游戏时取到，而且每模拟一局游戏能取到非常多组数据。在论文中作者提出了经验回放（experience replay）的采集数据方法，即事先采样足够多组的数据放入一个固定容量的经验池中，然后每次训练时从该经验池中随机取出一个batch的数据进行梯度下降更新参数。值得注意的是这一个batch的数据训练完成后是放回经验池的，也就是说下次训练时是可以复用的。只有当产生新的数据时，才会更新经验池。当一轮训练完成，更新完模型参数后，再根据该模型提供的策略进行模拟游戏，产生新的数据并放入经验池中，由于该经验池是有最大容量的，所以最早的一些数据会被新的数据替代。像这样每玩几局游戏训练一次（玩游戏的同时其实是在更新训练数据），极大地提升了训练的效率。此外，为了更加稳定地训练模型，作者提出了固定target值的思想，具体做法是复制一个和原本Q网络一模一样的target-Q网络用于计算target值，使得target-Q网络的参数在一定时间段内保持固定不变，在这段时间内用梯度下降训练Q网络。然后阶段性地根据Q网络学习完后的参数更新这个target-Q网络（就是把Q网络的参数再复制过去）。

Q-Learning的target是 $R_{t+1}+γ\max_{a′}Q(S_{t+1},a′)$。它使用 $ϵ$−greedy策略来生成action $a_{t+1}$，但用来计算target的action却不一定是$a_{t+1}$，而是使得$Q(S_{t+1},a)$最大的action。这种产生行为的策略和进行评估的策略不一样的方法称为Off-policy方法。对于Q-Learning来说，产生行为的策略是$$ϵ$$−greedy，而进行评估的策略是greedy。$Q(s,a)\leftarrow Q(s,a)+\alpha(R+\gamma\max_{a’}Q(s’,a’)-Q(s,a))$

SarSa中使用$ϵ$−greedy策略生成action $a_{t+1}$，随即又用 $a_{t+1}$处对应的值函数来计算target，更新上一步的值函数。这种学习方式又称为On-policy。$Q(s,a)\leftarrow Q(s,a)+\alpha(R+\gamma Q(s’,a’)-Q(s,a))$

Double DQN

DDQN的模型结构基本和DQN的模型结构一模一样，唯一不同的就是它们的目标函数。
$$
Q^{DQN}{target}(s,a) = R{t+1} + \gamma \max_aQ(s’,a;\hat\theta)\
Q^{DoubleDQN}{target}(s,a) = R{t+1} + \gamma Q(s’, \operatorname{argmax}_a Q(s’,a;\theta),\hat\theta)
$$
这两个target函数的区别在于DoubleDQN的最优动作选择是根据当前正在更新的Q网络的参数 $\theta_t$，而DQN中的最优动作选择是根据target-Q网络的参数$\hat\theta_t$。这样做的原因是传统的DQN通常会高估Q值的大小（overestimation）。

而DDQN由于每次选择的根据是当前Q网络的参数，并不是像DQN那样根据target-Q的参数，所以当计算target值时是会比原来小一点的。（因为计算target值时要通过target-Q网络，在DQN中原本是根据target-Q的参数选择其中Q值最大的action，而现在用DDQN更换了选择以后计算出的Q值一定是小于或等于原来的Q值的）这样在一定程度上降低了overestimation，使得Q值更加接近真实值。

Dueling DQN

把Q function拆分为state function和advantage function ：$Q(s,a;\theta, \alpha, \beta) = V(s;\theta, \beta)+A(s,a;\theta,\alpha)$, 式中，$V(s;\theta, \beta)$是state function，输出一个标量，$A(s,a;\theta,\alpha)$是advantage function，输出一个矢量，矢量长度等于动作空间大小； $\theta$指网络卷积层的参数； $\alpha$和$\beta$分别是2个分支的全连接层的参数
$$
V^\pi(s) = \mathbb E_{a\sim\pi(s)}[Q^\pi(s,a)] \
A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)
$$
V代表了在当前状态s下，Q值的平均期望（综合考虑了所有可选动作）。A代表了在选择动作a时Q值超出期望值的多少。两者相加就是实际的Q(s,a)。所以这样设计模型就是为了让神经网络对给定的s有一个基本的判断，在这个基础上再根据不同的action进行修正。但是按照上述想法直接训练是有问题的，问题就在于若神经网络把V训练成固定值0后，就相当于普通的DQN网络了，因为此时的A值就是Q值。所以我们需要给我们的神经网络加一个约束条件，让所有动作对应的A值之和为零，使得训练出的V值是所有n个Q值的平均值（n代表可选择的动作个数）。
$$
Q(s,a;\theta, \alpha, \beta) = V(s;\theta, \beta)+ (A(s,a;\theta,\alpha) - \frac{1}{|\mathcal A|} \sum_{a’}A(s,a’;\theta,\alpha)
$$
式中，里括号中的部分其实就是之前说的A值，公式里的$|\mathcal A|$代表了可选择动作的个数。可以明显看出若把|A|个动作对应的括号中的部分相加，它们的和为零。所以问题就转化为利用神经网络求上述公式中的V(s; \theta,\beta)与A(s,a; \theta, \alpha)。其中V(s; θ \theta θ, β \beta β)就是前文所提到的V值，而A(s,a; θ \theta θ, α \alpha α)和前文所提到的广义上的A值其实不一样，但可以通过A(s,a; θ \theta θ, α \alpha α)计算出A值。

Summary

Double DQN：目的是减少因为max Q值计算带来的计算偏差，或者称为过度估计（over estimation）问题，用当前的Q网络来选择动作，用目标Q网络来计算目标Q。
Prioritised replay：也就是优先经验的意思。优先级采用目标Q值与当前Q值的差值来表示。优先级高，那么采样的概率就高。
Dueling Network：将Q网络分成两个通道，一个输出V，一个输出A，最后再合起来得到Qr。

Policy-based method

Policy gradient

一个角度来看

State-value function: $V_\pi(s_t) = \mathbb E_A [Q_\pi(s_t, A)] = \sum_a \pi(a|s_t) Q_\pi(s_t,a)$

通过state-value function的定义引出用策略网络来近似策略函数

Approximate the state-value function：
- Approximate policy function $\pi(a|s_t)$ by policy network $\pi(a|s_t;\theta)$.
- Approximate value function $V_\pi(s_t)$ by $V_\pi(s_t) = \sum_a \pi(a|s_t;\theta) Q_\pi(s_t,a)$.
用评价函数$L(\theta)$来评价策略的好坏，进而用策略梯度进行求解

learn $\theta$ that maximizes $L(\theta)=\mathbb E_S[V(S;\theta)]$

How to improve $\theta$?
1. observe state $s$.
2. update policy by $\theta \leftarrow \theta + \beta \underbrace{ \frac{\partial V(s;\theta)}{\partial \theta}}{policy\ gradient}$.
  $$
  \begin{array}{rCl}
  \frac{\partial V(s;\theta)}{\partial \theta}
  &=& \frac{\partial \sum_a \pi(a|s;\theta) Q_\pi(s,a)}{\partial \theta} \
  &=& \sum_a \frac{\partial \pi(a|s;\theta) Q_\pi(s,a)}{\partial \theta} \
  &=& \sum_a \frac{\partial \pi(a|s;\theta) }{\partial \theta} Q_\pi(s,a) \quad \text{assume $Q{\pi}$ is independent of $\theta$} \
  &=& \sum_a \pi(a|s;\theta) ; \frac{\partial \log \pi(a|s;\theta) }{\partial \theta} Q_\pi(s,a) \
  &=& \mathbb E_{A\sim\pi(\cdot|s,\theta)} \left[ \frac{\partial \log \pi(a|s;\theta) }{\partial \theta} Q_\pi(s,a) \right]
  \end{array}
  $$
  Thus, we have two forms of policy gradient:
  $$
  \begin{array}{rCl}
  \text{Form 1: } &\quad& \frac{\partial V(s;\theta)}{\partial \theta} = \sum_a \frac{\partial \pi(a|s;\theta) }{\partial \theta} Q_\pi(s,a) \
  \text{Form 2: } &\quad& \frac{\partial V(s;\theta)}{\partial \theta} = \mathbb E_{A\sim\pi(\cdot|s,\theta)} \left[ \frac{\partial \log \pi(a|s;\theta) }{\partial \theta} Q_\pi(s,a) \right]
  \end{array}
  $$
- Case 1 (discrete actions) –> Form 1
  1. Calculate $f(a,\theta) = \frac{\partial \pi(a|s;\theta) }{\partial \theta} Q_\pi(s,a)$ for every acton $a\in \mathcal A$.
  2. Policy gradient: $\frac{\partial V(s;\theta)}{\partial \theta} = \sum_a f(a,\theta)$
    
    if $|\mathcal A|$ is vast, this approach is costly.
- Case 2 (continuous actions) –> Form 2
  
  在求连续动作时，需要求定积分，这很难。一般采用Monte-Carlo近似的方法求出期望函数的无偏估计。（对于离散动作，这种方法也适用）。
  1. Randomly sample an action $\hat a$ according to the PDF $\pi(\cdot|s;\theta)$
  2. Calculate $g(\hat a, \theta) = \frac{\partial \log \pi(\hat a|s;\theta) }{\partial \theta} Q_\pi(s, \hat a)$
  3. Use $g(\hat a, \theta)$ as an approximation to the policy gradient $\frac{\partial V(s;\theta)}{\partial \theta}$.

Algorithm:
Pros and Cons:
- Advantages:
  - efficient in high-dimensional or continuous action spaces (output actions probability)
  - can learn stochastic policies
  - better convergence properties
- Disadvantages:
  - typocally converge to a local rather global optimum
  - evaluating a policy is typically inefficient and high variance

遗留的问题： How to calculate $q_t$???

Method 1: Reinforce

Play the game to the end and generate the trajectry <$s_1, a_1, r_1, s_2, a_2, r_2, \cdots, S_T, a_T, r_T$>
Compute the discounted return $G_t = \sum_{k=t}^T \gamma^{k-t} r_k$ for all $t$.
Since $Q_\pi(s_t,a_t) = \mathbb E[G_t]$, we can use $u_t$ to approximate $Q_\pi(s_t,a_t)$.
–> $q_t = u_t$.

Method 2: Actor-Critic

approximate $Q_\pi$ using a neural network.

另一个角度来看

Loss: $L(\theta) = \mathbb E(r1 + \gamma r2 + \gamma^2r_3 + \cdots | \pi(,\theta))$ , 所有带衰减reward的累计期望

如何能够计算出损失函数关于参数的梯度（也就是策略梯度）$\nabla_\theta L(\theta)$:

对于一个策略网络，输入state，输出action的概率。然后execute action, 得到reward。
如果某一个动作得到reward多，使其出现的概率增大，如果某一个动作得到的reward少，使其出现的概率减小。
如果能够构造一个好的动作评判指标，来判断一个动作的好与坏，那么我们就可以通过改变动作的出现概率来优化策略！

–> 使用log likelihood $\log \pi(a|s,\theta)$, 且给定critic metric是$f(s,a)$.

–> Loss：$L(\theta) = \sum \log \pi(a|s,\theta) f(s,a)$

Parametrise the policy: $\pi_\theta(s,a)=\mathbb P \left[a|s,\theta\right]$.
$$
\nabla_\theta \pi_\theta(s,a) = \pi_\theta(s,a) \frac{\nabla_\theta \pi_\theta(s,a)}{\pi_\theta(s,a)} = \underbrace{\pi_\theta(s,a)}{policy} \underbrace{\nabla_\theta \log \pi_\theta(s,a)}{score\ function}
$$

softmax policy: [discrete action space]
- weighting actions using linear combination of features $\phi(s,a)^\top\theta$
- probability of action is proportional to exponentiated weight $\pi_\theta(s,a) \propto e^{\phi(s,a)^\top \theta}$
- The score function is $\nabla_\theta \log \pi_\theta(s,a) = \phi(s,a) - \mathbb E_{\pi_\theta} [\phi(s,\cdot)]$
Gaussian policy: [continuous action space]
- mean is linear combination of state feature $\mu(s) = \phi(s)^\top \theta$
- variance is fixed $\sigma^2$ or parameterised
- policy is Gaussian $a\sim\mathcal N(\mu(s),\sigma^2)$
- the score function is $\nabla_\theta \log \pi_\theta(s,a) = \frac{(a-\mu(s))\phi(s)}{\sigma^2}$

缺点：

require full episodes
high gradients variance. $\nabla L \simeq \mathbb E [Q(s,a) \nabla \log \pi(a|s)]$ which is proportional to the discounted reward from the given state.

–> add baseline from $Q$, which can be :
- some constant value (mean of the discounted rewards)
- moving average of discouted rewards
- the value of the state $V(s)$
Exploration. entropy 表征uncertainty。

为避免local optimum，substract entropy from the loss function and punish the entropy to be too certain.
Correlation between samples.

Actor-critic and Variants

Actor-critic

reduce the variance of the gradient (由于很难infinite交互，期望与真实有差异，会带来较大的variance)

actor-critic用一个独立模型估计轨迹的长期return，而不是直接使用轨迹的真实return。（类似于基于模型的Q-Learning 算法，在估计时使用模型估计轨迹价值，在更新时利用轨迹的回报得到目标价值，然后将模型的估计值和目标值进行比较，从而改进模型。）

$\nabla_\theta L(\theta) = \frac{1}{N} \sum_{i=1}^N\sum_{t=0}^T \left[\nabla_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) \left(\sum_{t’=t}^{T}r(s_{i,t’},a_{i,t’}) - b_i \right)\right]$

方案：
1. 使用策略梯度法: $\sum_{t’=t}^{T}r(s_{i,t’},a_{i,t’}) - b_i$
2. 使用状态值函数估计轨迹的return: $q(s,a)$
3. 使用优势函数估计轨迹的return: $A(s,a) = q(s,a) - V(s)$
4. 使用TD-Error估计轨迹的return: $r(s,a) + q(s) - q(s’)$ [只需要算一个价值函数V，V函数和动作无关，可以用来计算Q和A]
Evaluation for value function
- Monte Carlo （用整条轨迹计算）
  - $V^\pi(s_t) \approx \sum_{t’=t}^T r(s_{t’}, a_{t’})$
  - Training data: $\left{ \left(s_{i,t}, \underbrace{\sum_{t’=t}^T r(s_{t’}, a_{t’})}{y{i,t}}\right) \right}$
  - Loss: $L = \frac{1}{2} \sum_i | \hat V^\pi(s_i) - y_i |^2$
- TD （bootstrap):
  - Training data: $\left{ \left(s_{i,t}, \underbrace{ r(s_{i,t}, a_{i,t})+ \hat V^\pi_\phi (s_{i,t+1})}{y{i,t}}\right) \right}$
  - 引入了适当的bias， –> 减小variance
Actor Critic Design Decisions
- 用两个网络分别去拟合Actor网络和Critic网络
  - 优势是容易训练且稳定；缺点是没有共享feature，导致参数量增大，计算量也增大
- 用同一个网络去拟合：
  - 解决了两个网络的优势，但是有可能会出现两个部分冲突的问题
Actor-Critic方法和Policy Gradient方法各有优劣：Actor-Critic方法方差小但是有偏，Policy-Gradient无偏但是方差大：
- Actor-critic: $\nabla_\theta L(\theta) = \frac{1}{N} \sum_{i=1}^N\sum_{t=0}^T \left[\nabla_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) \left(r(s_{i,t},a_{i,t}) + \gamma\hat V^\pi_\phi(s_{i,t+1}) - \hat V^\pi_\phi(s_{i,t}) - b_i \right)\right]$
  
  lower variance with bias
- Policy graddient: $\nabla_\theta L(\theta) = \frac{1}{N} \sum_{i=1}^N\sum_{t=0}^T \left[\nabla_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) \left(\sum_{t’=t}^{T}\gamma^{t’-t}r(s_{i,t’},a_{i,t’}) - b_i \right)\right]$
  
  no bias with higher variace (because use single sample estimate)
So, 结合两种方法，–> no bias with low variance (because baseline is close to reward)
$$
\nabla_\theta L(\theta) = \frac{1}{N} \sum_{i=1}^N\sum_{t=0}^T \left[\nabla_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) \left(\sum_{t’=t}^{T} \gamma^{t’-t}r(s_{i,t’},a_{i,t’}) - \hat V_\phi^\pi(s_{i,t}) \right)\right]
$$
如果用整条轨迹，或者仅仅一个step，都有缺点 –> 折中一下 –> n-step
$$
\nabla_\theta L(\theta) = \frac{1}{N} \sum_{i=1}^N\sum_{t=0}^T \left[\nabla_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) \left(\sum_{t’=t}^{t+n} \gamma^{t’-t}r(s_{i,t’},a_{i,t’}) - \hat V_\phi^\pi(s_{i,t}) + \gamma^n \hat V_\phi^\pi(s_{i,t+n})\right)\right]
$$

随着AC的run，networks变得越来越确定optimum value，sigma变得越来越小，越来越倾向于exploiting，而不是exploring

A2C

A3C

Parallel actor-learners have a stabilizing effect on training allowing many methods (DQN, n-step Sarsa, AC) to successfully train NN.

与memory replay不同，A3C允许多个agents独立play on separate environments. Each environment is in CPU.

global optimizer and global actor-critic. each agent interact environments in separate thread. Agents reach terminal state –> gradient descent on global optimizer

Pseudocode:

具体实现：

使用torch的multiprocessing模块来生成多个trainer进程，每个进程里都会有一个player与环境交互并采集样本，定期利用buffer中的数据进行梯度更新。为了保证异步更新的有效性，在main进程里实例化了一个share model (global_actor_critic)，并且通过global的optimizer来对异步收集的梯度进行优化。

1
2
3

global_actor_critic = Agent(input_dims=input_dims, n_actions=n_actions, gamma=GAMMA) # global network
global_actor_critic.actor_critic.share_memory() # share the global parameters in multiprocessing
optimizer = SharedAdam(global_actor_critic.actor_critic.parameters(), lr=lr, betas=(0.92,0.999)) # global optimizer

在每个train进程中，每个agent需要定期地和global_actor_critic进行同步，从而保证整体优化的方向一致。

1	self.local_actor_critic.actor_critic.load_state_dict(self.global_actor_critic.actor_critic.state_dict())

因此，我们可以得到train进程使用的是需要优化的目标策略，并且该模型是被所有进程所共享的，在global optimizer中可以实现share model （global_actor_critic）的更新。但是，在train进程中，每个agent的model是单独实例化的，因此并不和share model共享内存空间，只是会定期地通过load_state_dict()来获取共享模型的参数。player和环境交互采样得到的trajectory用于计算policy loss和value loss，通过backward获得model的梯度。然而这些梯度都是train进程player model上的，并不在share model中，所以必须要有某个机制来进行同步。因此，可以以下方法实现：

1 2	for local_param, global_param in zip(self.local_actor_critic.actor_critic.parameters(), self.global_actor_critic.actor_critic.parameters()): global_param._grad = local_param.grad

注意，这里的=赋值是一个浅拷贝，因此是共享地址的，也就是share model的参数的梯度和player model参数的梯度共享一个内存空间，那么自然梯度就完成了同步。

Actor-Critic based methods

TRPO

Property：

On-policy
can be used for environments with either discrete or continuous action spaces.
supports parallelization

policy update: $\theta = \theta_{old} + \alpha \nabla_\theta J$.

更新步长$\alpha$非常重要，当步长不合适时，更新的参数所对应的策略是一个更不好的策略，当利用这个更不好的策略进行采样学习时，再次更新的参数会更差，因此很容易导致越学越差，最后崩溃。选择合适的步长$\alpha$使得新的回报函数的值单调递增。

回报函数： $\eta(\tilde\pi)=E_{\tau \mid \tilde\pi}\left[\sum_{t=0}^{\infty} \gamma^{t}\left(r\left(s_{t}\right)\right)\right]$.

如果可以将新的策略所对应的回报函数$\eta(\tilde\pi)$分解成旧的策略所对应的回报函数$$\eta(\pi)$$+其他项。只要新的策略所对应的其他项大于等于零，那么新的策略就能保证回报函数单调不减。

$$
\eta(\tilde\pi)=\eta(\pi)+E_{s_{0}, a_{0}, \cdots \sim\eta(\tilde\pi)}\left[\sum_{t=0}^{\infty} \gamma^{t} A_{\pi}\left(s_{t}, a_{t}\right)\right]
$$

其中，$\tilde\pi$ 和 $\pi$ 分别表示新策略和旧策略，$\begin{array}{c}
A_{\pi}(s, a)=Q_{\pi}(s, a)-V_{\pi}(s)
=E_{s^{\prime} \sim P\left(s^{\prime} \mid s, a\right)}\left[r(s)+\gamma V^{\pi}\left(s^{\prime}\right)-V^{\pi}(s)\right]
\end{array}$.

Proof:
$$
\begin{array}{rcl}
&& E_{\tau \mid \tilde{\pi}}\left[\sum\limits_{t=0}^{\infty} \gamma^{t} A_{\pi}\left(s_{t}, a_{t}\right)\right] \
&=& E_{\tau \mid \tilde{\pi}}\left[\sum\limits_{t=0}^{\infty} \gamma^{t}\left(r(s)+\gamma V^{\pi}\left(s_{t+1}\right)-V^{\pi}\left(s_{t}\right)\right)\right] \
&=& E_{\tau \mid \tilde{\pi}}\left[\sum\limits_{t=0}^{\infty} \gamma^{t}\left(r\left(s_{t}\right)\right)+\sum\limits_{t=0}^{\infty} \gamma^{t}\left(\gamma V^{\pi}\left(s_{t+1}\right)-V^{\pi}\left(s_{t}\right)\right)\right] \
&=& E_{\tau \mid \tilde{\pi}}\left[\sum\limits_{t=0}^{\infty} \gamma^{t}\left(r\left(s_{t}\right)\right)\right]+E_{s_{0}}\left[-V^{\pi}\left(s_{0}\right)\right] \
&=&\eta(\tilde{\pi})-\eta(\pi)
\end{array}
$$

因此，可以转化为
$$
\begin{array}{rCl}
\eta(\tilde{\pi}) &=&
\eta(\pi)+\sum\limits_{t=0}^{\infty} \sum\limits_{s} P\left(s_{t}=s \mid \tilde{\pi}\right)\sum\limits_{a} \tilde{\pi}(a \mid s) \gamma^{t} A_{\pi}(s, a) \
&=& \eta(\pi)+\sum\limits_{s} \rho_{\tilde{\pi}}(s) \sum\limits_{a} \tilde{\pi}(a \mid s) A^{\pi}(s, a)
\end{array}
$$
其中，$\rho_{\pi}(s)=P\left(s_{0}=s\right)+\gamma P\left(s_{1}=s\right)+\gamma^{2} P\left(s_{2}=s\right)+\cdots$表示discounted visitation frequencies。此时状态$s$的分布对新策略$\tilde\pi$严重依赖。

用旧策略$\pi$的状态分布来代替新策略$\tilde\pi$的状态分布，surrogate loss function is
$$
\begin{array}{rCl}
L_\pi(\tilde{\pi})
&=& \eta(\pi)+\sum\limits_{s} \rho_(s) \sum\limits_{a} \tilde{\pi}(a \mid s) A^{\pi}(s, a)
\end{array}
$$
此时，产生动作$a$是基于新的策略$\tilde\pi$，但新策略$\tilde\pi$的参数$\theta$是未知的，无法应用。

利用重要性采样对动作分布进行处理:
$$
\sum_{a} \tilde{\pi}{\theta}\left(a \mid s{n}\right) A_{\theta_{\text {old}}}\left(s_{n}, a\right)=E_{a \sim q}\left[\frac{\tilde{\pi}{\theta}\left(a \mid s{n}\right)}{q\left(a \mid s_{n}\right)} A_{\theta_{\text{old}}}\left(s_{n}, a\right)\right]
$$
surrogate loss function变为：
$$
\sum_{a} \tilde{\pi}{\theta}\left(a \mid s{n}\right) A_{\theta_{\text {old}}}\left(s_{n}, a\right)=E_{a \sim q}\left[\frac{\tilde{\pi}{\theta}\left(a \mid s{n}\right)}{q\left(a \mid s_{n}\right)} A_{\theta_{\text{old}}}\left(s_{n}, a\right)\right]
$$
比较该surrogate loss function $L_\pi(\tilde\pi)$ 和 $\eta(\tilde\pi)$:
$$
\begin{aligned}
L_{\pi_{\theta_{o l d}}}\left(\pi_{\theta_{o l d}}\right) &=\eta\left(\pi_{\theta_{o l d}}\right) \
\left.\nabla_{\theta} L_{\pi_{\theta_{o l d}}}\left(\pi_{\theta}\right)\right|{\theta=\theta{o l d}} &=\left.\nabla_{\theta} \eta\left(\pi_{\theta}\right)\right|{\theta=\theta{o l d}}
\end{aligned}
$$

利用不等式：$\eta(\tilde{\pi}) \geq L_{\pi}(\tilde{\pi})-C D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi})$, where $C=\frac{4 \epsilon \gamma}{(1-\gamma)^{2}}$.

该不等式给出了$\eta(\tilde\pi)$的lower bound $M_i(\pi) = L_{\pi_i}(\pi) - CD_{KL}^{\max}(\pi_i, \pi)$

证明策略的单调性：

$\eta\left(\pi_{i+1}\right) \geqslant M_{i}\left(\pi_{i+1}\right)$, 且 $\eta\left(\pi_{i}\right) = M_{i}\left(\pi_{i}\right)$, 则 $\eta\left(\pi_{i+1}\right)-\eta\left(\pi_{i}\right) \geqslant M_{i}\left(\pi_{i+1}\right)-M\left(\pi_{i}\right)$.

如果新策略能够使得$M_i$最大，我们可以得到$M_i(\pi_{i+1}) - M(\pi_i) \geq 0$, 则 $\eta(\pi_{i+1})-\eta(\pi_i) \geq 0$.

使得$M_i$最大的策略 $\iff$ $\underset{\theta}{\operatorname{maximize}}\left[L_{\theta_{\text {old }}}(\theta)-C D_{\mathrm{KL}}^{\max }\left(\theta_{\text {old }}, \theta\right)\right]$

利用惩罚因子$C$, the step size will be small. –> add constraint on the KL divergence (trust region constraint):
$$
\begin{array}{l}
\underset{\theta}{\operatorname{maximize}} \quad \mathbb{E}{s \sim \rho{\theta_{\text {old}}}, a \sim \pi_{\theta_{old}}}\left[\frac{\pi_{\theta}(a \mid s)}{\pi_{\theta_{old}}(a \mid s)} A_{\theta_{\text {old}}}(s, a)\right] \
\text {subject to} \quad D^{\max}{\mathrm{KL}}\left(\theta{\text {old}}, {\theta}\right) \leq \delta
\end{array}
$$
在约束条件中，利用平均KL散度代替最大KL散度。该问题化简为
$$
\begin{array}{l}
\underset{\theta}{\operatorname{maximize}} \quad \mathbb{E}{s \sim \rho{\theta_{\text {old }}}, a \sim q}\left[\frac{\pi_{\theta}(a \mid s)}{q(a \mid s)} Q_{\theta_{\text {old }}}(s, a)\right] \
\text { subject to} \quad \mathbb{E}{s \sim \rho{\theta_{\text {old}}}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{\text {old}}}(\cdot \mid s) | \pi_{\theta}(\cdot \mid s)\right)\right] \leq \delta
\end{array}
$$
The objective function (surrogate advantage) is a measure of how policy $\pi_\theta$ performs relative to the old policy $\pi_{\theta_{old}}$ using data from the old policy.

接下来就是利用采样得到数据，然后求样本均值，解决优化问题即可。至此，TRPO理论算法完成。

求解优化问题：

The objective and constraint are both zero when $\theta=\theta_{old}$. Furthermore, the gradient of the constraint with respect to $\theta$ is zero when $\theta=\theta_{old}$. TRPO makes some approximations to get an answer quickly. We Taylor expand the objective and constraint to leading order around $\theta_{old}$:
$$
\begin{array}{c}
\mathcal{L}\left(\theta_{k}, \theta\right) \approx g^{T}\left(\theta-\theta_{k}\right) \
\bar{D}{K L}\left(\theta | \theta{k}\right) \approx \frac{1}{2}\left(\theta-\theta_{k}\right)^{T} H\left(\theta-\theta_{k}\right)
\end{array}
$$
resulting in an approximate optimization problem,
$$
\begin{array}{r}
\theta_{k+1}=\arg \max {\theta} g^{T}\left(\theta-\theta{k}\right) \
\text { s.t. } \frac{1}{2}\left(\theta-\theta_{k}\right)^{T} H\left(\theta-\theta_{k}\right) \leq \delta
\end{array}
$$
This approximate problem can be analytically solved by the methods of Lagrangian duality, yielding the solution:
$$
\theta_{k+1}=\theta_{k}+\sqrt{\frac{2 \delta}{g^{T} H^{-1} g}} H^{-1} g
$$
A problem is that, due to the approximation errors introduced by the Taylor expansion, this may not satisfy the KL constraint, or actually improve the surrogate advantage. TRPO adds a modification to this update rule: a backtracking line search,
$$
\theta_{k+1}=\theta_{k}+\alpha^{j} \sqrt{\frac{2 \delta}{g^{T} H^{-1} g}} H^{-1} g
$$
where $\alpha\in (0,1)$ is the backtracking coefficient, and $j$ is the smallest nonnegative integer such that $\pi_\theta$ satisfies the KL constraint and produces a positive surrogate advantage.

Lastly: computing and storing the matrix inverse, $H^{-1}$, is painfully expensive when dealing with neural network policies with thousands or millions of parameters. TRPO sidesteps the issue by using the conjugate gradient algorithm to solve $Hx=g$ for $x=H^{-1}g$ , requiring only a function which can compute the matrix-vector product $Hx$ instead of computing and storing the whole matrix $H$ directly. This is not too hard to do: we set up a symbolic operation to calculate
$$
H x=\nabla_{\theta}\left(\left(\nabla_{\theta} \bar{D}{K L}\left(\theta | \theta{k}\right)\right)^{T} x\right)
$$
which gives us the correct output without computing the whole matrix.

Over the course of training, the policy typically becomes progressively less random, as the update rule encourages it to exploit rewards that it has already found. This may cause the policy to get trapped in local optima.

PPO

Fundamental Knowledge:

TRPO:
$$
\begin{array}{ll}
\underset{\theta}{\operatorname{maximize}} & \hat{\mathbb{E}}{t}\left[\frac{\pi{\theta}\left(a_{t} \mid s_{t}\right)}{\pi_{\theta_{\text {old }}}\left(a_{t} \mid s_{t}\right)} \hat{A}{t}\right] \
\text { subject to } & \widehat{\mathbb{E}}{t}\left[\operatorname{KL}\left[\pi_{\theta_{\text {old }}}\left(\cdot \mid s_{t}\right), \pi_{\theta}\left(\cdot \mid s_{t}\right)\right]\right] \leq \delta
\end{array}
$$
This problem can efficiently be approximately solved using the conjugate gradient algorithm, after making a linear approximation to the objective and a quadratic approximation to the constraint. The theory justifying TRPO actually suggests using a penalty instead of a constraint, i.e., solving the unconstrained optimization problem
$$
\underset{\theta}{\operatorname{maximize}} \hat{\mathbb{E}}{t}\left[\frac{\pi{\theta}\left(a_{t} \mid s_{t}\right)}{\pi_{\theta_{\text {old }}}\left(a_{t} \mid s_{t}\right)} \hat{A}{t}-\beta \operatorname{KL}\left[\pi{\theta_{\text {old }}}\left(\cdot \mid s_{t}\right), \pi_{\theta}\left(\cdot \mid s_{t}\right)\right]\right]
$$

for some coefficient $\beta$.

a certain surrogate objective (which computes the max KL over states instead of the mean) forms a lower bound (i.e., a pessimistic bound) on the performance of the policy $\pi$.

TRPO uses a hard constraint rather than a penalty because it is hard to choose a single value of $\beta$ that performs well across different problems—or even within a single problem, where the the characteristics change over the course of learning.
Actor-critic:

sensitive to perturbations (small change in parameters of network –> big change of policy space)

In order to address this issue, PPO:

limits update to policy network, and
- base the update on the ratio of new policy to old. (make the ratio be confined into a range –> not take big step in parameter space)
Need to account for goodness of state (advantage)
clip the loss function and take lower bound with minimum function
Track a fixed length trajectory of memories (instead of many transitions)
Use multiple network updates per data sample. Use minibatch stochastic gradient ascent
can use multiple parallel CPU

Important Components:

Update Actor is different
$$
L^{CPI}(\theta) = \hat{\mathbb E} \left[ \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t) }\hat A_t \right] = \hat{\mathbb E} \left[r_t(\theta) \hat A_t \right]
$$
- based on ratio of new policy to old policy (can use log)
- consider the advantage
Only maximize $L^{CPI}(\theta)$ will lead to large policy update, –> Modify objective

Adding epsilon (about 0.2) for clip/min operations
$$
L^{CPI}(\theta) = \hat{\mathbb E} \left[ \min\left(r_t(\theta)\hat{A}_t;, ; \operatorname{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \right)\hat{A}_t \right]
$$
pessimistic lower bound to the loss
- smaller loss, smaller gradient, smaller update
Advantages at each time step:
$$
\hat{A}t = \delta_t + (\gamma\lambda)\delta{t+1} + \cdots +(\gamma\lambda)^{T-t+1}\delta_{T-1}
$$
where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$.
- show the benefit of the new state over the old state
- $\lambda$ is a smoothing parameter (about 0.95) to help to reduce variance
- Nested for loops
Critic loss is straightforward
- Return = advantage + critic_value (from memory)
- $L_{critic}$ = MSE( return - critic_value ) [from network]
Total loss is sum of clipped actor and critic
$$
L_t^{CLIP+VF+S} = \hat{\mathbb E} \left[ L_t^{CLIP}(\theta) - c_1 L_t^{VF} + c_2 S\pi_\theta \right]
$$
- coefficient $c_1$ of the critic
- $S$ term is the entropy term [which is required if the actor and critic are coupled].

总结：

PPO: off-policy

通过importance sampling实现离线更新策略（可以使用行为策略所得到的数据用来更新目标策略）

使用$\theta_{old}$采样的数据，训练$\theta$这个actor，过程中$\theta_{old}$是fixed的所以可以重复使用用$\theta_{old}$的数据训练$\theta$许多次，增加数据利用率，提高训练速度。也就是说，在importance sampling中，我们使用$\theta_{old}$获得数据，来估计$\theta$的期望分布。、
但是注意，此时两种策略的分布不能差别很大。–> 引入了约束条件。
- TRPO是加入了KL($\theta_{old},\theta$) divergence的constraint
- PPO是在目标函数后加入了关于$\beta$KL($\theta_{old},\theta$)的惩罚项.

DDPG

同样，也是Actor-Critic架构。

Actor:
- Input: state $s$
- Output: action $a$。【在Actor-Critic方法中，输出的是一个概率分布】
- 更新方法是基于梯度上升的。
- 该网络的损失函数就是从critic网络中获取的Q值的平均值，在实现的过程中，需要加入负号，即最小化损失函数，来与深度学习框架保持一致。用数学公式表示其损失函数就是：$L_{actor}=\mathbb E \left[{Q(s,a|\theta^Q)}|_{s=s_t,a=\mu(s_t|\theta^\mu)} \right]$.
Critic：
- Input: state $s$ and action $a$
- Output: Q value $Q(s,a)$。【在Actor-Critic方法中，输出的是$V(s)$】
- 通过最小化目标网络与现有网络之间的均方误差来更新现有网络的参数，
加入experience replay buffer，存储agent与env之间的交互数据。
在实际操作中，如果更新目标在变化，会导致更新困难。–> use “soft” target updates, rather than directly copy the weights –> 方法：添加target actor 和 target critic 网络。并在更新target 网络参数时增加权重$\tau$，即每一步仅采用相对小的权重采用相应训练中的network更新；如此的目的在于尽可能保障训练能够收敛；
Exploration via random process, 为actor采取的action基础上增加一定的随机扰动, 以保障一定的探索完整动作空间的几率。一般的, 相应随机扰动的幅度随着训练的深入而逐步递减；
Batch normalization, 为每层神经网络之前加入batch normalization层, 可以降低不对状态量取值范围差异对模型稳定性的影响程度。

TD3

DDPG起源于DQN，是DQN解决连续控制问题的一个解决方法。而DQN有一个众所周知的问题，就是Q值会被高估。这是因为我们用$\operatorname{argmax}Q(s’)$去代替$V(s’)$，去评估$Q(s)$。当我们每一步都这样做的时候，很容易就会出现高估$Q$值的情况。类似地，DDPG也会有这个问题。可以借鉴double DQN的思路来解决这个问题。在TD3中，可以用两套网络估算$Q$值，相对较小的那个作为我们更新的目标。这就是TD3的基本思路。

DDPG算法涉及了4个网络，TD3需要用到6个网络。

Actor:

与DDPG相同
Critic:
- 两个target ciritc网络，用于计算两个Q值：$Q_1(A’)$和$Q_2(A’)$.
- $\min(Q_1(A’), Q_2(A’))$用来计算target value，i.e. $y_i = r+ \gamma \min(Q_1(A’), Q_2(A’))$. 而计算出的target value也是两个critic网络的更新目标。
- 两个critic 网络的意义：虽然更新目标一样，两个网络会越来越趋近与和实际q值相同。但由于网络参数的初始值不一样，会导致计算出来的值有所不同。所以我们可以选择较小的值去估算q值，避免q值被高估。
送

SAC

Maximum Entropy RL

之前的RL算法中，学习目标主要是学习一个policy使得累积reward的期望值最大：$\pi^{*}=\arg \max {\pi} \mathbb{E}{\left(s_{t}, a_{t}\right) \sim \rho_{\pi}}\left[\sum_{t} R\left(s_{t}, a_{t}\right)\right]$.

而最大熵RL的学习目标，除了学习一个policy使得累积reward最大，还要求policy的每一次输出的action的entropy最大：
$$
\pi^{*}=\arg \max {\pi} \mathbb{E}{\left(s_{t}, a_{t}\right) \sim \rho_{\pi}} \left[\sum_{t} \underbrace{R\left(s_{t}, a_{t}\right)}{\text {reward }}+\alpha \underbrace{H\left(\pi\left(\cdot \mid s{t}\right)\right)}_{\text {entropy }}\right]
$$
$\alpha$是temperature parameter决定entropy项和reward之间的relative importance。通过增加entropy项，可使得策略随机化，即输出的每一个action的概率尽可能均匀，而不是集中在一个action上。从而鼓励exploration，也可以学到更多near-optimal行为（也就是在一些状态下，可能存在多个action都是最优的，那么使得选择它们的概率相同，可以提高学习速度）。

在信息论中，熵（entropy）是接收的每条消息中包含的信息的平均量，又被稱為信息熵、信源熵、平均自信息量。这里，“消息”代表来自分布或数据流中的事件、样本或特征。（熵最好理解为不确定性的量度而不是确定性的量度，因为越随机的信源的熵越大。）来自信源的另一个特征是样本的概率分布。这里的想法是，比较不可能发生的事情，当它发生了，会提供更多的信息。由于一些其他的原因，把信息（熵）定义为概率分布的对数的相反数是有道理的。事件的概率分布和每个事件的信息量构成了一个随机变量，这个随机变量的均值（即期望）就是这个分布产生的信息量的平均值（即熵）。

Soft Policy Iteration

Soft policy evaluation

思路1（通过DP得到Soft bellman equation）：

因此可以得到Soft Bellman Backup equation (Entropy项额外乘上$\alpha$系数) :
$$
q_{\pi}(s, a)=r(s, a)+\gamma \sum_{s^{\prime} \in \mathcal{S}} \mathcal{P}{s s^{\prime}}^{a} \sum{a^{\prime} \in \mathcal{A}} \pi\left(a^{\prime} \mid s^{\prime}\right)\left(q_{\pi}\left(s^{\prime}, a^{\prime}\right)-\alpha \log \left(\pi\left(a^{\prime} \mid s^{\prime}\right)\right)\right.
$$
可以得到Soft Bellman Backup的更新公式:
$$
Q_{\text{soft}}\left(s_{t}, a_{t}\right)=r\left(s_{t}, a_{t}\right)+\gamma \mathbb{E}{s{t+1}, a_{t+1}}\left[Q_{\text{soft}}\left(s_{t+1}, a_{t+1}\right)-\alpha \log \left(\pi\left(a_{t+1} \mid s_{t+1}\right)\right)\right]
$$

思路2（将entropy嵌入reward）：

Reward with entropy:
$$
r_{\text{soft}}\left(s_{t}, a_{t}\right)=r\left(s_{t}, a_{t}\right)+\gamma \alpha \mathbb{E}{s{t+1} \sim \rho} H\left(\pi\left(\cdot \mid s_{t+1}\right)\right)
$$
将该reward代入Bellman equation $Q\left(s_{t}, a_{t}\right)=r\left(s_{t}, a_{t}\right)+\gamma \mathbb{E}{s{t+1}, a_{t+1}}\left[Q\left(s_{t+1}, a_{t+1}\right)\right]$, 得到
$$
\begin{array}{rCl}
Q_{\text{soft}}\left(s_{t}, a_{t}\right) &=& r\left(s_{t}, a_{t}\right)+ \gamma \alpha \mathbb{E}{s{t+1} \sim \rho} H\left(\pi\left(\cdot \mid s_{t+1}\right)\right) + \gamma \mathbb{E}{s{t+1}, a_{t+1}}\left[Q_{\text{soft}}\left(s_{t+1}, a_{t+1}\right)\right] \
&=& r\left(s_{t}, a_{t}\right)+ \gamma \mathbb{E}{s{t+1}\sim\rho,a_{t+1}\sim\pi} \left[Q_{\text{soft}}\left(s_{t+1}, a_{t+1}\right)\right] + \gamma \alpha \mathbb{E}{s{t+1}\sim\rho} H\left(\pi\left(\cdot \mid s_{t+1}\right)\right) \
&=& r\left(s_{t}, a_{t}\right)+ \gamma \mathbb{E}{s{t+1}\sim\rho} \mathbb{E}{a{t+1}\sim\pi} \left[Q_{\text{soft}}\left(s_{t+1}, a_{t+1}\right)\right] - \gamma \alpha \mathbb{E}{s{t+1}\sim\rho}\mathbb{E}{a{t+1}\sim\pi} \log\pi(a_{t+1}|s_{t+1}) \
&=& r\left(s_{t}, a_{t}\right)+ \gamma \mathbb{E}{s{t+1}\sim\rho} \left[ \mathbb{E}{a{t+1}\sim\pi}\left[Q_{\text{soft}}\left(s_{t+1}, a_{t+1}\right) - \alpha \log\pi(a_{t+1}|s_{t+1}) \right]\right] \
&=& r\left(s_{t}, a_{t}\right)+ \gamma \mathbb{E}{s{t+1},a_{t+1}}\left[Q_{\text{soft}}\left(s_{t+1}, a_{t+1}\right) - \alpha \log\pi(a_{t+1}|s_{t+1}) \right] \
\end{array}
$$
该结果也和思路1得到的结果相同。

因此，根据$Q\left(s_{t}, a_{t}\right)=r\left(s_{t}, a_{t}\right)+\gamma \mathbb{E}{s{t+1} \sim \rho}\left[V\left(s_{t+1}\right)\right]$，可得到$V_{\text{soft}}(s_t)$:
$$
V_{\text{soft}}\left(s_{t}\right)=\mathbb{E}{a{t} \sim \pi}\left[Q_{\text {soft}}\left(s_{t}, a_{t}\right)-\alpha \log \pi\left(a_{t} \mid s_{t}\right)\right]
$$

Soft Policy Evaluation: 固定policy，使用soft Bellman equation更新Q value直到收敛
$$
Q_{\text{soft}}\left(s_{t}, a_{t}\right)=r\left(s_{t}, a_{t}\right)+\gamma \mathbb{E}{s{t+1}, a_{t+1}}\left[Q_{\text{soft}}\left(s_{t+1}, a_{t+1}\right)-\alpha \log \left(\pi\left(a_{t+1} \mid s_{t+1}\right)\right)\right]
$$

Soft policy improvement

stochastic policy的重要性：面对多模的（multimodal）的Q function，传统的RL只能收敛到一个选择（左图），而更优的办法是右图，让policy也直接符合Q的分布。

这里，通过定义energy-based policy as：$\pi(a_t|s_t) \propto \exp(-\mathcal E(s_t,a_t))$。设定$\mathcal E(s_t,a_t)=-\frac{1}{\alpha}Q_{\text{soft}}(s_t,a_t)$, 可以得到$\pi(a_t|s_t) \propto \exp(-\mathcal E(s_t,a_t))$。根据$V_{\text{soft}}\left(s_{t}\right)=\mathbb{E}{a{t} \sim \pi}\left[Q_{\text {soft}}\left(s_{t}, a_{t}\right)-\alpha \log \pi\left(a_{t} \mid s_{t}\right)\right]$, 可以得到

$$
\begin{array}{rCl}
\pi(s_t,a_t) &=& \exp \left( \frac{1}{\alpha} \left(Q_{\text{soft}}(s_t,a_t) - V_{\text{soft}}(s_t) \right)\right) \
&=& \frac{\frac{1}{\alpha}Q_{\text{soft}}(s_t,a_t)}{\frac{1}{\alpha}V_{\text{soft}}(s_t)} \propto Q_{\text{soft}}(s_t,a_t)
\end{array}
$$
其中，$\exp \left(\frac{1}{\alpha} V_{\text {soft }}\left(s_{t}\right)\right)=\int \exp \left(\frac{1}{\alpha} Q_{\text {soft }}\left(s_{t}, a\right)\right) d a$，因此，$V_{\text {soft }}\left(s_{t}\right) \triangleq \alpha \log \int \exp \left(\frac{1}{\alpha} Q_{\text {soft }}\left(s_{t}, a\right)\right) d a$

所以，$\underset{a}{\operatorname{soft max}} f(a):=\log \int \exp f(a) d a$, –> $Q_{\text {soft }}\left(s_{t}, a_{t}\right)=\mathbb{E}\left[r_{t}+\gamma \underset{a}{\operatorname{soft max}} Q\left(s_{t+1}, a\right)\right]$.

因此可以得到：$\pi_{\text{MaxEnt}}^{}\left(a_{t} \mid s_{t}\right)=\exp \left(\frac{1}{\alpha}\left(Q_{\text {soft }}^{}\left(s_{t}, a_{t}\right)-V_{\text {soft}}^{*}\left(s_{t}\right)\right)\right)$。

Soft Policy Improvement: 更新policy towards the exponential of new Q-function. Restrict the policy to some set of policies, like Gaussian.
$$
\pi^{\prime}=\arg \min {\pi{k} \in \Pi} D_{K L}\left(\pi_{k}\left(\cdot \mid s_{t}\right) \Big| \frac{\exp \left(\frac{1}{\alpha} Q_{\text {soft }}^{\pi}\left(s_{t}, \cdot\right)\right)}{Z_{\text {soft }}^{\pi}\left(s_{t}\right)}\right)
$$
其中, $Z^\pi_{\text{soft}}(s_t)$ 就是是一个配分函数，用于归一化分布，在求导的时候可以直接忽略。此处，通过KL divergence来趋近$\exp(Q_{\text{soft}}^{\pi}(s_t,\cdot))$ ，从而限制policy在一定范围的policies $\Pi$中，从而变得tractable，policy的分布可以是高斯分布。

Soft Actor Critic

旧版SAC：

新版SAC：

Critic：
$$
\begin{aligned}
J_{Q}(\theta) &=\mathbb{E}{\left(s{t}, a_{t}, s_{t+1}\right) \sim \mathcal{D}}\left[\frac{1}{2}\left(Q_{\theta}\left(s_{t}, a_{t}\right)-\left(r\left(s_{t}, a_{t}\right)+\gamma V_{\bar{\theta}}\left(s_{t+1}\right)\right)\right)^{2}\right] \
&=\mathbb{E}{\left(s{t}, a_{t}, s_{t+1}\right) \sim \mathcal{D}, a_{t+1} \sim \pi_{\phi}}\left[\frac{1}{2}\left(Q_{\theta}\left(s_{t}, a_{t}\right)-\left(r\left(s_{t}, a_{t}\right)+\gamma\left(Q_{\bar{\theta}}\left(s_{t+1}, a_{t+1}\right)-\alpha \log \left(\pi_{\phi}\left(a_{t+1} \mid s_{t+1}\right)\right)\right)\right)\right)^{2}\right]
\end{aligned}
$$
这里和DDPG一样，构造了一个target soft Q 网络带参数$\bar \theta$，这个参数通过exponentially moving average Q网络的参数$\theta$得到。

在第一个版本的SAC中，他们单独定义了Value网络进行更新，

在新版的SAC中，由于自动更新temperature $\alpha$就直接使用Q网络更新。
Actor：
$$
\begin{aligned}
J_{\pi}(\phi) &=D_{\mathrm{KL}}\left(\pi_{\phi}\left(. \mid s_{t}\right) | \exp \left(\frac{1}{\alpha} Q_{\theta}\left(s_{t}, .\right)-\log Z\left(s_{t}\right)\right)\right) \
&=\mathbb{E}{s{t} \sim \mathcal{D}, a_{t} \sim \pi_{\phi}}\left[\log \left(\frac{\pi_{\phi}\left(a_{t} \mid s_{t}\right)}{\exp \left(\frac{1}{\alpha} Q_{\theta}\left(s_{t}, a_{t}\right)-\log Z\left(s_{t}\right)\right)}\right)\right] \
&=\mathbb{E}{s{t} \sim \mathcal{D}, a_{t} \sim \pi_{\phi}}\left[\log \pi_{\phi}\left(a_{t} \mid s_{t}\right)-\frac{1}{\alpha} Q_{\theta}\left(s_{t}, a_{t}\right)+\log Z\left(s_{t}\right)\right]
\end{aligned}
$$
这里的action我们采用reparameterization trick来得到，即 $a_{t}=f_{\phi}\left(\varepsilon_{t} ; s_{t}\right)=f_{\phi}^{\mu}\left(s_{t}\right)+\varepsilon_{t} \odot f_{\phi}^{\sigma}\left(s_{t}\right)$。

$f$函数输出平均值和方差，然后$\epsilon$是noise，从标准正态分布采样。使用这个trick，整个过程就是完全可微的(loss 乘以$\alpha$并去掉不影响梯度的常量log partition function $Z(s_t)$：
$$
J_{\pi}(\phi)=\mathbb{E}{s{t} \sim \mathcal{D}, \varepsilon \sim \mathcal{N}}\left[\alpha \log \pi_{\phi}\left(f_{\phi}\left(\varepsilon_{t} ; s_{t}\right) \mid s_{t}\right)-Q_{\theta}\left(s_{t}, f_{\phi}\left(\varepsilon_{t} ; s_{t}\right)\right)\right]
$$
Update temperature

前面的SAC中，我们只是人为给定一个固定的temperature $\alpha$ 作为entropy的权重，但实际上由于reward的不断变化，采用固定的temperature并不合理，会让整个训练不稳定，因此，有必要能够自动调节这个temperature。当policy探索到新的区域时，最优的action还不清楚，应该调高temperature 去探索更多的空间。当某一个区域已经探索得差不多，最优的action基本确定了，那么这个temperature就可以减小。

这里，SAC的作者构造了一个带约束的优化问题，让平均的entropy权重是有限制的，但是在不同的state下entropy的权重是可变的，即 $\max {\pi{0}, \ldots, \pi_{T}} \mathbb{E}\left[\sum_{t=0}^{T} r\left(s_{t}, a_{t}\right)\right]$ s.t. $\forall t, \mathcal{H}\left(\pi_{t}\right) \geq \mathcal{H}_{0}$

最后得到temperature的loss：
$$
J(\alpha)=\mathbb{E}{a{t} \sim \pi_{t}}\left[-\alpha \log \pi_{t}\left(a_{t} \mid \pi_{t}\right)-\alpha \mathcal{H}_{0}\right]
$$

DDPG与SAC的区别：

DDPG训练得到的是一个deterministic policy确定性策略，也就是说这个策略对于一种状态state只考虑一个最优的动作。Deterministic policy的最终目标找到最优路径。

Stochastic policy随机策略在实际应用中往往是更好的做法。比如我们让机器人抓取一个水杯，机器人是有无数条路径去实现这个过程的，而并不是只有唯一的一种做法。因此，我们就需要DRL算法能够给出一个随机策略，在每一个state上都能输出每一种action的概率，比如有3个action都是最优的，概率一样都最大，那么我们就可以从这些action中随机选择一个做出action输出。最大熵maximum entropy的核心思想就是不遗落到任意一个有用的action，有用的trajectory。对比DDPG的deterministic policy的做法，看到一个好的就捡起来，差一点的就不要了，而最大熵是都要捡起来，都要考虑。

Stochastic policy要求熵最大，就意味着神经网络需要去explore探索所有可能的最优路径，优势如下：

学到policy可以作为更复杂具体任务的初始化。因为通过最大熵，policy不仅仅学到一种解决任务的方法，而是所有all。因此这样的policy就更有利于去学习新的任务。比如我们一开始是学走，然后之后要学朝某一个特定方向走。

更强的exploration能力，这是显而易见的，能够更容易的在多模态reward （multimodal reward）下找到更好的模式。比如既要求机器人走的好，又要求机器人节约能源

更robust鲁棒，更强的generalization。因为要从不同的方式来探索各种最优的可能性，也因此面对干扰的时候能够更容易做出调整。（干扰会是学习过程中看到的一种state，既然已经探索到了，学到了就可以更好的做出反应，继续获取高reward）

基于最大熵的RL算法有什么优势？

以前用deterministic policy的算法，我们找到了一条最优路径，学习过程也就结束了。现在，我们还要求熵最大，就意味着神经网络需要去explore探索所有可能的最优路径，这可以产生以下多种优势：

1）学到policy可以作为更复杂具体任务的初始化。因为通过最大熵，policy不仅仅学到一种解决任务的方法，而是所有all。因此这样的policy就更有利于去学习新的任务。比如我们一开始是学走，然后之后要学朝某一个特定方向走。

2）更强的exploration能力，这是显而易见的，能够更容易的在多模态reward （multimodal reward）下找到更好的模式。比如既要求机器人走的好，又要求机器人节约能源

3）更robust鲁棒，更强的generalization。因为要从不同的方式来探索各种最优的可能性，也因此面对干扰的时候能够更容易做出调整。（干扰会是神经网络学习过程中看到的一种state，既然已经探索到了，学到了就可以更好的做出反应，继续获取高reward）

Summary

DRL中，特别是面向连续控制continuous control，主要的三类算法：

TRPO,PPO
DDPG及其拓展（D4PG,TD3等）
Soft Q-Learning, Soft Actor-Critic

PPO算法是目前最主流的DRL算法，同时面向离散控制（discrete control）和连续控制（continuous control）。但是PPO是一种on-policy的算法，面临着严重的sample inefficiency，需要巨量的采样才能学习，因此在实际应用中较难实现。

DDPG及其拓展则是面向连续控制的off policy算法，相对PPO 更sample efficient。DDPG训练的是一种确定性策略deterministic policy，即每一个state下都只考虑最优的一个动作。

Soft Actor-Critic (SAC)是面向Maximum Entropy Reinforcement learning 开发的一种off policy算法，和DDPG相比，Soft Actor-Critic使用的是随机策略stochastic policy，相比确定性策略具有一定的优势。Soft Actor-Critic效果好，且能直接应用到真实机器人上。

2. Bellman Equation of Optimality

Posted on 2020-03-30 Edited on 2021-08-16 In Reinforcement Learning

Tabular Learning and the Bellman Equation

Value: expected total reward that is obtained from the state.
$$
V(s)=\mathbb E \left[ \sum_{t=0}^{\infty} r_t\gamma^t \right]
$$
where $r_t$ is the local reward obtained at time $t$ of the episode.

Then 可以选择action，based on values.

但是在一些复杂的环境下，‘exploration (探索)’ 是非常重要的，因为从长远来看greedy strategy不一定是好的。探索才能得到optimum并且avoid trap。

1. Bellman equation of optimality

Deterministic cases:

For an agent in state $s_0$, there are N actions to reach states $s_1, s_2, \cdots, s_N$ with rewards $r_1, r_2, \cdots, r_N$. Assume we know the value $V_1, V_2, \cdots, V_N$。 So, $V_0(a=a_i)=r_i+V_i$.

in order to choose the best action, agent needs to calculate the resulting values for every action and choose the maximum possible outcome. $V_0 = \max_{a\in {1, 2,\cdots, N}} (r_a+V_a)$. If we use discount factor $\gamma$, then it becomes $V_0 = \max_{a\in {1, 2,\cdots, N}} (r_a+\gamma V_a)$. So we look at the immediate reward + long-term value of the sate.

Stochastic cases:
$$
V_0 = \mathbb E_{s\sim S} \left[r_{s,a}+\gamma V_s\right] = \sum\limits_{s\in S} p_{a,0\rightarrow s} (r_{s,a}+\gamma V_s)
$$

For a general cases, the Bellman optimality equation is
$$
V_0 = \max_{a\in A} \mathbb E_{s\sim S} \left[r_{s,a}+\gamma V_s\right] = \max_{a\in A} \sum\limits_{s\in S} p_{a,0\rightarrow s} (r_{s,a}+\gamma V_s)
$$
where $p_{a,i\rightarrow j}$ is the probability of action a, from state $i$ to state $j$.

Meaning: the optimal value of the state is equal to the action, which gives us the maximum possible expected immediate reward, plus discounted long-term reward for the next state. This definition is recursive: the value of the state is defined via the values of immediate reachable states.

state的最优value等于最好action对应的value，也就是最大可能的期望即时reward加上乘以折扣因子的下个状态的长期reward。显然，这个定义是recursive的，这个状态的value是使用reachable下一个状态的value来计算的。

2. Value Iteration

2.1 Value of action

Values of $Q$ for every pair of state and action:

$$
Q_{s,a} = \mathbb E_{s’\sim S} \left[r_{s,a} + \gamma V_{s’}\right] = \sum_{s;’\in S} {p_{a,s\rightarrow s’}(r_{s,a}+\gamma V_{s’})}
$$

$Q$ for the state $s$ and action $a$ equals the expected immediate reward and the discounted long-term reward of the destination state.

Value of state $V_s$:

$$
V_s = \max_{a\in A} Q_{s,a}
$$

The value of a state equals to the value of the maximum action we can execute from this state.

Therefore, we can obtain
$$
Q_{s,a} = r_{s,a} + \gamma \max_{a’\in A}Q_{s’,a’}
$$

2.2 Value iteration (Q-learning) algorithm

Can numerically calculate the values of states and values of actions of MDPs with known transition probabilities and rewards.

Procedure for values of states:

Initialize values of all states $V_i$ to some initial value (usually zero or random)
For every state $s$ in the MDP, perform the Bellman update $V_s \leftarrow \max_a \sum_{s’} p_{a,s\rightarrow s’} (r_{s,a}+\gamma V_{s’})$
Repeat step 2 for some large number of steps until changes become too small

Procedure for values of actions:

Initialize all $Q_{s,a}$ to zero
For every state $s$ and every action $a$ in this state, perform the update $Q_{s,a} \leftarrow \sum_{s’} p_{a,s\rightarrow s’} (r_{s,a}+\gamma \max_{a’} Q_{s’,a’})$
Repeat step 2 for some large number of steps until changes become too small

Limitations:

state space should be discrete and small to perform multiple iterations over all states. (possible solution: discretization)
rarely know the transition probability for the actions and reward matrix. In Bellman, we need both a reward for every transition and probability of this transition. (solution: use experience as an estimation for both unknowns.)

2.3 Practice

three tables:

Reward table: dictionary
- key (state + action + target state)
- value: immediate reward
Transition table: dictionary to count the experienced transitions.
- Key: state + action,
- value: another dictionary mapping the target state into a count of times that we have seen it. 看到target state的频数。
- used to estimate the probabilities of transitions. 用来估测 transition probability
Value table: dictionary mapping a state into the calculated value.

Procedure:

play random steps from the environment –> populate the reward and transition tables.
perform a value iteration loop for all states, update value table
paly several full episodes to check the improvement using the updated value table.

Code example of value iteration:

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On/blob/master/Chapter05/01_frozenlake_v_iteration.py

Main functions in the code:

play_n_random-steps(count):
- gather random experience from environment and update reward and transition tables
calc_action_value(state, action):
- calculate the value of the action from the state, using transition, reward and value tables.
- so, we can select the best action and calculate the new value of the state.
select_action(state):
play_episode(env):
value_iteration():

Code example of Q learning：

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On/blob/master/Chapter05/02_frozenlake_q_iteration.py

Difference from the previous code of value iteration:

Value table. In value iteration, we keep the value of state, so the key in the value table is a state. In Q-learning, Q-function has two parameters: state and action, so the key in the value table is (state, action)
In Q-learning, we do not need the calc-action_value function, since the action value is stored in the value table.
value_iteration. In value iteration,

1. Cross-entropy method

Posted on 2020-03-01 Edited on 2021-08-16 In Reinforcement Learning

RL 方法分类：

model-free v.s. model-based
- model-free:
  - 不需要build a model for 环境或者reward，直接连接observations到actions
  - 适合deterministic environments，比如board game with strict rules.
- model-based：
  - 尝试predict下一步的next observation 或 reward，并基于预测来学则可能最好的action.
  - easier to train, 当建模比较难的时候
value-based v.s. policy-based
- Policy-based:
  - directly approximate the policy of the agent
  - policy is usually represented by probability distribution over available actions.
- value-based:
  - 计算每个可能action的value，选择最好的action。
on-policy v.s. off-policy
- off-policy: 能够学习old historical data (obtained by a previous version of the agent or recorded by human demonstration or just seen by the same agent several episodes ago)
- on-policy: requires fresh data obtained from the environment

1. Cross-Entropy method

Property:
- Simple
- Good convergence: works well in simple environments (不需要复杂多步policy，有short episodes with frequent rewards).
- model-free
- policy-based
- on-policy
Pipeline of the loop:
- pass the observation to the network –> get probability distribution over actions –> random sampling using the distribution –> get action –> obtain next observation
Core of cross-entropy:
- 舍弃不好的episodes，用好的episodes训练
Steps：
1. Play N episodes
2. calculate total reward for each episode, decide on a reward boundary (percentile, e.g., 60th)
3. throw away all episodes with a reward below the boundary
4. train on the remaining episodes
5. repeat
Based on this procedure, neural network can learn how to repeat actions which leads to a larger reward. Then the boundary will be higher.

2. Cross-entropy on CartPole

use one-hidden-layer neural network with ReLU and 128 hidden neurons.

code example：

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On/blob/master/Chapter04/01_cartpole.py

3. Cross-entropy on FrozenLake

grid world with specific rules

code example:

naïve cross-entropy, which may fail.

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On/blob/master/Chapter04/02_frozenlake_naive.py

Shows the limitation of cross-entropy:

no intermediate indication about whether the agent has succeeded or failed.
total reward of episodes should have enough variability to separate the good episodes from the bad ones.
for training, episodes have to be finite and short.

tweaked code of cross-entropy:

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On/blob/master/Chapter04/03_frozenlake_tweaked.py

modification: 1) lager batches, 2) add discount factor, 3) keep good episodes for a longer time, 4) decrease learning rate, 5) much longer training time.

4. Theoretical background of cross-entropy

basis of cross-entropy lies in the importance sampling theorem:
$$
\mathbb E_{x\sim p(x)} \left[H(x)\right] = \int_x p(x)H(x) dx = \int_x q(x)\frac{p(x)}{q(x)}H(x) dx = \mathbb E_{x\sim q(x)} \left[\frac{p(x)}{q(x)}H(x)\right]
$$
In RL, $H(x)$ is a reward value obtained by some policy $x$, and $p(x)$ is the distribution of all possible policies.

我们不想通过search所有可能的policy来maximizes我们的reward。我们希望可以找到一种方法来用$q(x)$来近似$p(x)H(x)$，通过迭代减小这两者之间的distance。

这两者的distance可以通过KL divergence来计算。KL divergence的表达式是
$$
KL(p_1(x) | p_2(x)) = \mathbb E_{x\sim p_1(x)} \left[\log\frac{p_1(x)}{p_2(x)}\right] = \underbrace{\mathbb E_{x\sim p_1(x)}\left[\log{p_1(x)}\right]}{\text{entropy}} - \underbrace{\mathbb E{x\sim p_1(x)} \left[\log{p_2(x)}\right]}{\text{corss-entropy}}
$$
结合两个公式，我们可以得到一个iterative algorithm，starting with $q_0(x)=p(x)$ 并且每一步都在提高。近似$p(x)H(x)$的迭代公式为
$$
q{i+1}(x)= \operatorname{argmin}{q{i+1}(x)} -\mathbb E_{x\sim q_i(x)} \left[\frac{p(x)}{q_i(x)}H(x) \log q_{i+1}(x)\right]
$$
In RL,

首先，replace $H(x)$ with an indicator function. 也就是说function值为1 当reward of this episode 高于threshold；值为0如果reward小于。我们的policy update 就变成了
$$
\pi_{i+1}(a|s)= \operatorname{argmin}{\pi{i+1}} -\mathbb E_{z\sim\pi_{i}(a|s)} \left[R(z)\geq \psi_i\right] \log \pi_{i+1}(a|s)
$$
意思是，用我们现在的policy来sample episodes，然后minimize the negative log likelihood of the most successful samples and policy.