Need help in deriving Policy Evaluation (Prediction)

Question

Policy Evaluation is computing the state-value function for an arbitary policy $\pi$.(suton & barto book). Now

\begin{equation} v_{\pi} = E_{\pi}[\;G_{t}\,|\;S_{t}=s] \qquad\qquad\qquad\qquad(4.1)\\\ = E_{\pi}[R_{t+1}+\gamma G_{t+1} |S_{t}=s]\qquad\qquad\quad(4.2)\\ = E_{\pi}[R_{t+1}+\gamma v_{\pi}(S_{t+1})| S_{t}=s]\qquad\qquad(4.3)\\ = \sum_{a} \pi(a|s) \sum_{s^{'},r} p(s^{'},r|s,a)[r+\gamma v_{\pi}(s^{'})]\quad(4.4) \end{equation}

Can someone help me to figure out from (4.3) to (4.4). And If equation (4.4) is bellman's equation for state value function, then how is it different from below equation \begin{equation} v_{\pi} = \sum_{a\in A} \pi(a|s)[R_{s}^{a}+\gamma \sum_{s'\in S} P_{ss^{'}}^{a}v_{\pi}({s^{'})}] \end{equation}

Related: https://datascience.stackexchange.com/questions/17860/reward-dependent-on-state-action-versus-state-action-successor-state?rq=1 — Neil Slater, Mar 22 '18 at 14:33

Neil Slater · Accepted Answer · 2018-03-25T08:27:30.347

Going from step (4.3) to (4.4) is turning the expectation (dependent on following policy $\pi$) into a more concrete calculation. To do this, you must resolve your random variables ($R_{t+1}$ and $S_{t+1}$) into the specific values from the finite sets of $\mathcal{R}$ and $\mathcal{S}^+$ - where individual set members are noted $r$ and $s$.

Recall that $\pi(a|s)$ is the probability of taking action $a$ given state $s$, so by taking a sum of any function depending on $(s,a)$ e.g. $\sum_{a} \pi(a|s)F(s,a)$ you will get the expected value of $F(s,a)$ starting from state $s$ and following policy $\pi$.

So you could write (4.3b) as:

$$v(s) = \sum_{a} \pi(a|s)\mathbb{E}_{\pi}[R_{t+1}+\gamma v_{\pi}(S_{t+1})| S_{t}=s, A_t=a]\qquad\qquad(4.3b)\\$$

Similarly, now you have written the expectation when given $s$ and $a$ (so you have all the possible cases of $a$ available to work with), you can use the transition and reward probabilities to fully resolve the expectation and substitute the random variables ($R_{t+1}$ and $S_{t+1}$) for their MDP model distributions described in $p(s',r|s,a)$, leading to equation (4.4). E.g.

$$v(s) = \sum_{a} \pi(a|s)\sum_{s',r} p(s',r|s,a)\mathbb{E}_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1})| S_{t+1}=s', R_{t+1}=r]\qquad\qquad(4.3c)\\$$

$$\qquad = \sum_{a} \pi(a|s)\sum_{s',r} p(s',r|s,a)(r + \gamma v_{\pi}(s'))\qquad\qquad(4.4)\\$$

Your last equation is a slightly different formulation of the same result, where:

$R_s^a$ is the expected reward when taking action $a$ in state $s$. It is equal to $\sum_{s',r} p(s',r|s,a)r$. It is also equal to $\mathbb{E}[R_{t+1}|S_t=s, A_t=a]$, by definition and independently of $\pi$.
$P_{ss'}^a$ is the state transition probability - the probability of ending up in state $s'$ when taking action $a$ in state $s$. It is equal to $\sum_r p(s',r|s,a)$ given $s'$

Technically using $R_s^a$ loses some information about the dynamics of the underlying MDP. Although not anything important to reinforcement learning, which deals with maximising expected reward, so in some ways it is more convenient to already characterise the MDP with expected rewards.

Need help in deriving Policy Evaluation (Prediction)

1 Answers1