reinforcement learning: Decompose a policy gradient

Question

I am studying the policy gradient through the website: https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f

Couldn't figure out how the first equation becomes the second equation?

In the second equation, why the first Expectation value have only s_0, a_0, s_1, a_1 ... s_t, a_t but no reward involved? Also, the second Expectation Value has only r_t+1, s_t+1, ... r_T, s_T, but no action involved? Could anyone please share the thoughts/intuition behind this? Thank you!

Neil Slater · Answer 1 · 2019-12-11T15:58:35.433

The Medium post has brackets in the wrong place . . . the second expectation must be inside the sum to make sense, otherwise $t$ is not defined. You can see a couple of steps later that $Q(s_t, a_t)$ gets magically moved back inside the sum.

I cannot see a way to fix the first expectation though that uses $t$ before it is defined!

The expectations are trying to show which parts of the distribution of trajectories each calculation explicitly depends upon. I don't think you need to read into it anything more than that.

However, I would suggest you find another article which does not have these errors. There are similar derivations of Policy Gradient in Reinforcement Learning: An Introduction

reinforcement learning: Decompose a policy gradient

1 Answers1