1. Cross-entropy method

Posted on 2020-03-01 Edited on 2021-08-16 In Reinforcement Learning

RL 方法分类：

model-free v.s. model-based
- model-free:
  - 不需要build a model for 环境或者reward，直接连接observations到actions
  - 适合deterministic environments，比如board game with strict rules.
- model-based：
  - 尝试predict下一步的next observation 或 reward，并基于预测来学则可能最好的action.
  - easier to train, 当建模比较难的时候
value-based v.s. policy-based
- Policy-based:
  - directly approximate the policy of the agent
  - policy is usually represented by probability distribution over available actions.
- value-based:
  - 计算每个可能action的value，选择最好的action。
on-policy v.s. off-policy
- off-policy: 能够学习old historical data (obtained by a previous version of the agent or recorded by human demonstration or just seen by the same agent several episodes ago)
- on-policy: requires fresh data obtained from the environment

1. Cross-Entropy method

Property:
- Simple
- Good convergence: works well in simple environments (不需要复杂多步policy，有short episodes with frequent rewards).
- model-free
- policy-based
- on-policy
Pipeline of the loop:
- pass the observation to the network –> get probability distribution over actions –> random sampling using the distribution –> get action –> obtain next observation
Core of cross-entropy:
- 舍弃不好的episodes，用好的episodes训练
Steps：
1. Play N episodes
2. calculate total reward for each episode, decide on a reward boundary (percentile, e.g., 60th)
3. throw away all episodes with a reward below the boundary
4. train on the remaining episodes
5. repeat
Based on this procedure, neural network can learn how to repeat actions which leads to a larger reward. Then the boundary will be higher.

2. Cross-entropy on CartPole

use one-hidden-layer neural network with ReLU and 128 hidden neurons.

code example：

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On/blob/master/Chapter04/01_cartpole.py

3. Cross-entropy on FrozenLake

grid world with specific rules

code example:

naïve cross-entropy, which may fail.

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On/blob/master/Chapter04/02_frozenlake_naive.py

Shows the limitation of cross-entropy:

no intermediate indication about whether the agent has succeeded or failed.
total reward of episodes should have enough variability to separate the good episodes from the bad ones.
for training, episodes have to be finite and short.

tweaked code of cross-entropy:

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On/blob/master/Chapter04/03_frozenlake_tweaked.py

modification: 1) lager batches, 2) add discount factor, 3) keep good episodes for a longer time, 4) decrease learning rate, 5) much longer training time.

4. Theoretical background of cross-entropy

basis of cross-entropy lies in the importance sampling theorem:
$$
\mathbb E_{x\sim p(x)} \left[H(x)\right] = \int_x p(x)H(x) dx = \int_x q(x)\frac{p(x)}{q(x)}H(x) dx = \mathbb E_{x\sim q(x)} \left[\frac{p(x)}{q(x)}H(x)\right]
$$
In RL, $H(x)$ is a reward value obtained by some policy $x$, and $p(x)$ is the distribution of all possible policies.

我们不想通过search所有可能的policy来maximizes我们的reward。我们希望可以找到一种方法来用$q(x)$来近似$p(x)H(x)$，通过迭代减小这两者之间的distance。

这两者的distance可以通过KL divergence来计算。KL divergence的表达式是
$$
KL(p_1(x) | p_2(x)) = \mathbb E_{x\sim p_1(x)} \left[\log\frac{p_1(x)}{p_2(x)}\right] = \underbrace{\mathbb E_{x\sim p_1(x)}\left[\log{p_1(x)}\right]}{\text{entropy}} - \underbrace{\mathbb E{x\sim p_1(x)} \left[\log{p_2(x)}\right]}{\text{corss-entropy}}
$$
结合两个公式，我们可以得到一个iterative algorithm，starting with $q_0(x)=p(x)$ 并且每一步都在提高。近似$p(x)H(x)$的迭代公式为
$$
q{i+1}(x)= \operatorname{argmin}{q{i+1}(x)} -\mathbb E_{x\sim q_i(x)} \left[\frac{p(x)}{q_i(x)}H(x) \log q_{i+1}(x)\right]
$$
In RL,

首先，replace $H(x)$ with an indicator function. 也就是说function值为1 当reward of this episode 高于threshold；值为0如果reward小于。我们的policy update 就变成了
$$
\pi_{i+1}(a|s)= \operatorname{argmin}{\pi{i+1}} -\mathbb E_{z\sim\pi_{i}(a|s)} \left[R(z)\geq \psi_i\right] \log \pi_{i+1}(a|s)
$$
意思是，用我们现在的policy来sample episodes，然后minimize the negative log likelihood of the most successful samples and policy.