1. Cross-entropy method
RL 方法分类:
- model-free v.s. model-based
- model-free:
- 不需要build a model for 环境或者reward,直接连接observations到actions
- 适合deterministic environments,比如board game with strict rules.
- model-based:
- 尝试predict下一步的next observation 或 reward, 并基于预测来学则可能最好的action.
- easier to train, 当建模比较难的时候
- model-free:
- value-based v.s. policy-based
- Policy-based:
- directly approximate the policy of the agent
- policy is usually represented by probability distribution over available actions.
- value-based:
- 计算每个可能action的value,选择最好的action。
- Policy-based:
- on-policy v.s. off-policy
- off-policy: 能够学习old historical data (obtained by a previous version of the agent or recorded by human demonstration or just seen by the same agent several episodes ago)
- on-policy: requires fresh data obtained from the environment
1. Cross-Entropy method
Property:
- Simple
- Good convergence: works well in simple environments (不需要复杂多步policy,有short episodes with frequent rewards).
- model-free
- policy-based
- on-policy
Pipeline of the loop:
- pass the observation to the network –> get probability distribution over actions –> random sampling using the distribution –> get action –> obtain next observation
Core of cross-entropy:
- 舍弃不好的episodes,用好的episodes训练
Steps:
- Play N episodes
- calculate total reward for each episode, decide on a reward boundary (percentile, e.g., 60th)
- throw away all episodes with a reward below the boundary
- train on the remaining episodes
- repeat
Based on this procedure, neural network can learn how to repeat actions which leads to a larger reward. Then the boundary will be higher.
2. Cross-entropy on CartPole
use one-hidden-layer neural network with ReLU
and 128 hidden neurons.
code example:
3. Cross-entropy on FrozenLake
grid world with specific rules
code example:
naïve cross-entropy, which may fail.
Shows the limitation of cross-entropy:
- no intermediate indication about whether the agent has succeeded or failed.
- total reward of episodes should have enough variability to separate the good episodes from the bad ones.
- for training, episodes have to be finite and short.
tweaked code of cross-entropy:
modification: 1) lager batches, 2) add discount factor, 3) keep good episodes for a longer time, 4) decrease learning rate, 5) much longer training time.
4. Theoretical background of cross-entropy
basis of cross-entropy lies in the importance sampling theorem:
$$
\mathbb E_{x\sim p(x)} \left[H(x)\right] = \int_x p(x)H(x) dx = \int_x q(x)\frac{p(x)}{q(x)}H(x) dx = \mathbb E_{x\sim q(x)} \left[\frac{p(x)}{q(x)}H(x)\right]
$$
In RL, $H(x)$ is a reward value obtained by some policy $x$, and $p(x)$ is the distribution of all possible policies.
我们不想通过search所有可能的policy来maximizes我们的reward。我们希望可以找到一种方法来用$q(x)$来近似$p(x)H(x)$,通过迭代减小这两者之间的distance。
这两者的distance可以通过KL divergence来计算。KL divergence的表达式是
$$
KL(p_1(x) | p_2(x)) = \mathbb E_{x\sim p_1(x)} \left[\log\frac{p_1(x)}{p_2(x)}\right] = \underbrace{\mathbb E_{x\sim p_1(x)}\left[\log{p_1(x)}\right]}{\text{entropy}} - \underbrace{\mathbb E{x\sim p_1(x)} \left[\log{p_2(x)}\right]}{\text{corss-entropy}}
$$
结合两个公式,我们可以得到一个iterative algorithm,starting with $q_0(x)=p(x)$ 并且每一步都在提高。近似$p(x)H(x)$的迭代公式为
$$
q{i+1}(x)= \operatorname{argmin}{q{i+1}(x)} -\mathbb E_{x\sim q_i(x)} \left[\frac{p(x)}{q_i(x)}H(x) \log q_{i+1}(x)\right]
$$
In RL,
首先,replace $H(x)$ with an indicator function. 也就是说function值为1 当reward of this episode 高于threshold;值为0如果reward小于。我们的policy update 就变成了
$$
\pi_{i+1}(a|s)= \operatorname{argmin}{\pi{i+1}} -\mathbb E_{z\sim\pi_{i}(a|s)} \left[R(z)\geq \psi_i\right] \log \pi_{i+1}(a|s)
$$
意思是,用我们现在的policy来sample episodes, 然后minimize the negative log likelihood of the most successful samples and policy.