2. Bellman Equation of Optimality
Tabular Learning and the Bellman Equation
Value: expected total reward that is obtained from the state.
$$
V(s)=\mathbb E \left[ \sum_{t=0}^{\infty} r_t\gamma^t \right]
$$
where $r_t$ is the local reward obtained at time $t$ of the episode.
Then 可以选择action,based on values.
但是在一些复杂的环境下,‘exploration (探索)’ 是非常重要的,因为从长远来看greedy strategy不一定是好的。探索才能得到optimum并且avoid trap。
1. Bellman equation of optimality
Deterministic cases:
For an agent in state $s_0$, there are N actions to reach states $s_1, s_2, \cdots, s_N$ with rewards $r_1, r_2, \cdots, r_N$. Assume we know the value $V_1, V_2, \cdots, V_N$。 So, $V_0(a=a_i)=r_i+V_i$.
in order to choose the best action, agent needs to calculate the resulting values for every action and choose the maximum possible outcome. $V_0 = \max_{a\in {1, 2,\cdots, N}} (r_a+V_a)$. If we use discount factor $\gamma$, then it becomes $V_0 = \max_{a\in {1, 2,\cdots, N}} (r_a+\gamma V_a)$. So we look at the immediate reward + long-term value of the sate.
Stochastic cases:
$$
V_0 = \mathbb E_{s\sim S} \left[r_{s,a}+\gamma V_s\right] = \sum\limits_{s\in S} p_{a,0\rightarrow s} (r_{s,a}+\gamma V_s)
$$
For a general cases, the Bellman optimality equation is
$$
V_0 = \max_{a\in A} \mathbb E_{s\sim S} \left[r_{s,a}+\gamma V_s\right] = \max_{a\in A} \sum\limits_{s\in S} p_{a,0\rightarrow s} (r_{s,a}+\gamma V_s)
$$
where $p_{a,i\rightarrow j}$ is the probability of action a, from state $i$ to state $j$.
Meaning: the optimal value of the state is equal to the action, which gives us the maximum possible expected immediate reward, plus discounted long-term reward for the next state. This definition is recursive: the value of the state is defined via the values of immediate reachable states.
state的最优value等于最好action对应的value,也就是最大可能的期望即时reward加上乘以折扣因子的下个状态的长期reward。显然,这个定义是recursive的,这个状态的value是使用reachable下一个状态的value来计算的。
2. Value Iteration
2.1 Value of action
- Values of $Q$ for every pair of state and action:
$$
Q_{s,a} = \mathbb E_{s’\sim S} \left[r_{s,a} + \gamma V_{s’}\right] = \sum_{s;’\in S} {p_{a,s\rightarrow s’}(r_{s,a}+\gamma V_{s’})}
$$
$Q$ for the state $s$ and action $a$ equals the expected immediate reward and the discounted long-term reward of the destination state.
- Value of state $V_s$:
$$
V_s = \max_{a\in A} Q_{s,a}
$$
The value of a state equals to the value of the maximum action we can execute from this state.
Therefore, we can obtain
$$
Q_{s,a} = r_{s,a} + \gamma \max_{a’\in A}Q_{s’,a’}
$$
2.2 Value iteration (Q-learning) algorithm
Can numerically calculate the values of states and values of actions of MDPs with known transition probabilities and rewards.
Procedure for values of states:
- Initialize values of all states $V_i$ to some initial value (usually zero or random)
- For every state $s$ in the MDP, perform the Bellman update $V_s \leftarrow \max_a \sum_{s’} p_{a,s\rightarrow s’} (r_{s,a}+\gamma V_{s’})$
- Repeat step 2 for some large number of steps until changes become too small
Procedure for values of actions:
- Initialize all $Q_{s,a}$ to zero
- For every state $s$ and every action $a$ in this state, perform the update $Q_{s,a} \leftarrow \sum_{s’} p_{a,s\rightarrow s’} (r_{s,a}+\gamma \max_{a’} Q_{s’,a’})$
- Repeat step 2 for some large number of steps until changes become too small
Limitations:
- state space should be discrete and small to perform multiple iterations over all states. (possible solution: discretization)
- rarely know the transition probability for the actions and reward matrix. In Bellman, we need both a reward for every transition and probability of this transition. (solution: use experience as an estimation for both unknowns.)
2.3 Practice
three tables:
- Reward table: dictionary
- key (state + action + target state)
- value: immediate reward
- Transition table: dictionary to count the experienced transitions.
- Key: state + action,
- value: another dictionary mapping the target state into a count of times that we have seen it. 看到target state的频数。
- used to estimate the probabilities of transitions. 用来估测 transition probability
- Value table: dictionary mapping a state into the calculated value.
Procedure:
- play random steps from the environment –> populate the reward and transition tables.
- perform a value iteration loop for all states, update value table
- paly several full episodes to check the improvement using the updated value table.
Code example of value iteration:
Main functions in the code:
play_n_random-steps(count)
:- gather random experience from environment and update reward and transition tables
calc_action_value(state, action)
:- calculate the value of the action from the state, using transition, reward and value tables.
- so, we can select the best action and calculate the new value of the state.
select_action(state):
play_episode(env):
value_iteration():
Code example of Q learning:
Difference from the previous code of value iteration:
- Value table. In value iteration, we keep the value of state, so the key in the value table is a state. In Q-learning, Q-function has two parameters: state and action, so the key in the value table is (state, action)
- In Q-learning, we do not need the
calc-action_value
function, since the action value is stored in the value table. value_iteration
. In value iteration,