1. Cross-entropy method

RL 方法分类:

  • model-free v.s. model-based
    • model-free:
      • 不需要build a model for 环境或者reward,直接连接observations到actions
      • 适合deterministic environments,比如board game with strict rules.
    • model-based:
      • 尝试predict下一步的next observation 或 reward, 并基于预测来学则可能最好的action.
      • easier to train, 当建模比较难的时候
  • value-based v.s. policy-based
    • Policy-based:
      • directly approximate the policy of the agent
      • policy is usually represented by probability distribution over available actions.
    • value-based:
      • 计算每个可能action的value,选择最好的action。
  • on-policy v.s. off-policy
    • off-policy: 能够学习old historical data (obtained by a previous version of the agent or recorded by human demonstration or just seen by the same agent several episodes ago)
    • on-policy: requires fresh data obtained from the environment

1. Cross-Entropy method

  • Property:

    • Simple
    • Good convergence: works well in simple environments (不需要复杂多步policy,有short episodes with frequent rewards).
    • model-free
    • policy-based
    • on-policy
  • Pipeline of the loop:

    • pass the observation to the network –> get probability distribution over actions –> random sampling using the distribution –> get action –> obtain next observation
  • Core of cross-entropy:

    • 舍弃不好的episodes,用好的episodes训练
  • Steps:

    1. Play N episodes
    2. calculate total reward for each episode, decide on a reward boundary (percentile, e.g., 60th)
    3. throw away all episodes with a reward below the boundary
    4. train on the remaining episodes
    5. repeat

    Based on this procedure, neural network can learn how to repeat actions which leads to a larger reward. Then the boundary will be higher.

2. Cross-entropy on CartPole

use one-hidden-layer neural network with ReLU and 128 hidden neurons.

code example:

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On/blob/master/Chapter04/01_cartpole.py

3. Cross-entropy on FrozenLake

grid world with specific rules

code example:

naïve cross-entropy, which may fail.

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On/blob/master/Chapter04/02_frozenlake_naive.py

Shows the limitation of cross-entropy:

  • no intermediate indication about whether the agent has succeeded or failed.
  • total reward of episodes should have enough variability to separate the good episodes from the bad ones.
  • for training, episodes have to be finite and short.

tweaked code of cross-entropy:

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On/blob/master/Chapter04/03_frozenlake_tweaked.py

modification: 1) lager batches, 2) add discount factor, 3) keep good episodes for a longer time, 4) decrease learning rate, 5) much longer training time.

4. Theoretical background of cross-entropy

basis of cross-entropy lies in the importance sampling theorem:
$$
\mathbb E_{x\sim p(x)} \left[H(x)\right] = \int_x p(x)H(x) dx = \int_x q(x)\frac{p(x)}{q(x)}H(x) dx = \mathbb E_{x\sim q(x)} \left[\frac{p(x)}{q(x)}H(x)\right]
$$
In RL, $H(x)$ is a reward value obtained by some policy $x$, and $p(x)$ is the distribution of all possible policies.

我们不想通过search所有可能的policy来maximizes我们的reward。我们希望可以找到一种方法来用$q(x)$来近似$p(x)H(x)$,通过迭代减小这两者之间的distance。

这两者的distance可以通过KL divergence来计算。KL divergence的表达式是
$$
KL(p_1(x) | p_2(x)) = \mathbb E_{x\sim p_1(x)} \left[\log\frac{p_1(x)}{p_2(x)}\right] = \underbrace{\mathbb E_{x\sim p_1(x)}\left[\log{p_1(x)}\right]}{\text{entropy}} - \underbrace{\mathbb E{x\sim p_1(x)} \left[\log{p_2(x)}\right]}{\text{corss-entropy}}
$$
结合两个公式,我们可以得到一个iterative algorithm,starting with $q_0(x)=p(x)$ 并且每一步都在提高。近似$p(x)H(x)$的迭代公式为
$$
q
{i+1}(x)= \operatorname{argmin}{q{i+1}(x)} -\mathbb E_{x\sim q_i(x)} \left[\frac{p(x)}{q_i(x)}H(x) \log q_{i+1}(x)\right]
$$
In RL,

首先,replace $H(x)$ with an indicator function. 也就是说function值为1 当reward of this episode 高于threshold;值为0如果reward小于。我们的policy update 就变成了
$$
\pi_{i+1}(a|s)= \operatorname{argmin}{\pi{i+1}} -\mathbb E_{z\sim\pi_{i}(a|s)} \left[R(z)\geq \psi_i\right] \log \pi_{i+1}(a|s)
$$
意思是,用我们现在的policy来sample episodes, 然后minimize the negative log likelihood of the most successful samples and policy.