2. Bellman Equation of Optimality

Tabular Learning and the Bellman Equation

Value: expected total reward that is obtained from the state.
$$
V(s)=\mathbb E \left[ \sum_{t=0}^{\infty} r_t\gamma^t \right]
$$
where $r_t$ is the local reward obtained at time $t$ of the episode.

Then 可以选择action,based on values.

但是在一些复杂的环境下,‘exploration (探索)’ 是非常重要的,因为从长远来看greedy strategy不一定是好的。探索才能得到optimum并且avoid trap。

1. Bellman equation of optimality

Deterministic cases:

For an agent in state $s_0$, there are N actions to reach states $s_1, s_2, \cdots, s_N$ with rewards $r_1, r_2, \cdots, r_N$. Assume we know the value $V_1, V_2, \cdots, V_N$。 So, $V_0(a=a_i)=r_i+V_i$.

in order to choose the best action, agent needs to calculate the resulting values for every action and choose the maximum possible outcome. $V_0 = \max_{a\in {1, 2,\cdots, N}} (r_a+V_a)$. If we use discount factor $\gamma$, then it becomes $V_0 = \max_{a\in {1, 2,\cdots, N}} (r_a+\gamma V_a)$. So we look at the immediate reward + long-term value of the sate.

Stochastic cases:
$$
V_0 = \mathbb E_{s\sim S} \left[r_{s,a}+\gamma V_s\right] = \sum\limits_{s\in S} p_{a,0\rightarrow s} (r_{s,a}+\gamma V_s)
$$

For a general cases, the Bellman optimality equation is
$$
V_0 = \max_{a\in A} \mathbb E_{s\sim S} \left[r_{s,a}+\gamma V_s\right] = \max_{a\in A} \sum\limits_{s\in S} p_{a,0\rightarrow s} (r_{s,a}+\gamma V_s)
$$
where $p_{a,i\rightarrow j}$ is the probability of action a, from state $i$ to state $j$.

Meaning: the optimal value of the state is equal to the action, which gives us the maximum possible expected immediate reward, plus discounted long-term reward for the next state. This definition is recursive: the value of the state is defined via the values of immediate reachable states.

state的最优value等于最好action对应的value,也就是最大可能的期望即时reward加上乘以折扣因子的下个状态的长期reward。显然,这个定义是recursive的,这个状态的value是使用reachable下一个状态的value来计算的。

2. Value Iteration

2.1 Value of action

  • Values of $Q$ for every pair of state and action:

$$
Q_{s,a} = \mathbb E_{s’\sim S} \left[r_{s,a} + \gamma V_{s’}\right] = \sum_{s;’\in S} {p_{a,s\rightarrow s’}(r_{s,a}+\gamma V_{s’})}
$$

$Q$ for the state $s$ and action $a$ equals the expected immediate reward and the discounted long-term reward of the destination state.

  • Value of state $V_s$:

$$
V_s = \max_{a\in A} Q_{s,a}
$$

The value of a state equals to the value of the maximum action we can execute from this state.

Therefore, we can obtain
$$
Q_{s,a} = r_{s,a} + \gamma \max_{a’\in A}Q_{s’,a’}
$$

2.2 Value iteration (Q-learning) algorithm

Can numerically calculate the values of states and values of actions of MDPs with known transition probabilities and rewards.

Procedure for values of states:

  1. Initialize values of all states $V_i$ to some initial value (usually zero or random)
  2. For every state $s$ in the MDP, perform the Bellman update $V_s \leftarrow \max_a \sum_{s’} p_{a,s\rightarrow s’} (r_{s,a}+\gamma V_{s’})$
  3. Repeat step 2 for some large number of steps until changes become too small

Procedure for values of actions:

  1. Initialize all $Q_{s,a}$ to zero
  2. For every state $s$ and every action $a$ in this state, perform the update $Q_{s,a} \leftarrow \sum_{s’} p_{a,s\rightarrow s’} (r_{s,a}+\gamma \max_{a’} Q_{s’,a’})$
  3. Repeat step 2 for some large number of steps until changes become too small

Limitations:

  • state space should be discrete and small to perform multiple iterations over all states. (possible solution: discretization)
  • rarely know the transition probability for the actions and reward matrix. In Bellman, we need both a reward for every transition and probability of this transition. (solution: use experience as an estimation for both unknowns.)

2.3 Practice

three tables:

  1. Reward table: dictionary
    • key (state + action + target state)
    • value: immediate reward
  2. Transition table: dictionary to count the experienced transitions.
    • Key: state + action,
    • value: another dictionary mapping the target state into a count of times that we have seen it. 看到target state的频数。
    • used to estimate the probabilities of transitions. 用来估测 transition probability
  3. Value table: dictionary mapping a state into the calculated value.

Procedure:

  1. play random steps from the environment –> populate the reward and transition tables.
  2. perform a value iteration loop for all states, update value table
  3. paly several full episodes to check the improvement using the updated value table.

Code example of value iteration:

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On/blob/master/Chapter05/01_frozenlake_v_iteration.py

Main functions in the code:

  1. play_n_random-steps(count):

    • gather random experience from environment and update reward and transition tables
  2. calc_action_value(state, action):

    • calculate the value of the action from the state, using transition, reward and value tables.
    • so, we can select the best action and calculate the new value of the state.
  3. select_action(state):

  4. play_episode(env):

  5. value_iteration():

Code example of Q learning:

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On/blob/master/Chapter05/02_frozenlake_q_iteration.py

Difference from the previous code of value iteration:

  1. Value table. In value iteration, we keep the value of state, so the key in the value table is a state. In Q-learning, Q-function has two parameters: state and action, so the key in the value table is (state, action)
  2. In Q-learning, we do not need the calc-action_value function, since the action value is stored in the value table.
  3. value_iteration. In value iteration,