3

I am studying the policy gradient through the website: https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f

Couldn't figure out how the first equation becomes the second equation?

In the second equation, why the first Expectation value have only s_0, a_0, s_1, a_1 ... s_t, a_t but no reward involved? Also, the second Expectation Value has only r_t+1, s_t+1, ... r_T, s_T, but no action involved? Could anyone please share the thoughts/intuition behind this? Thank you!

enter image description here

Edamame
  • 2,705
  • 5
  • 23
  • 32

1 Answers1

1

The Medium post has brackets in the wrong place . . . the second expectation must be inside the sum to make sense, otherwise $t$ is not defined. You can see a couple of steps later that $Q(s_t, a_t)$ gets magically moved back inside the sum.

I cannot see a way to fix the first expectation though that uses $t$ before it is defined!

The expectations are trying to show which parts of the distribution of trajectories each calculation explicitly depends upon. I don't think you need to read into it anything more than that.

However, I would suggest you find another article which does not have these errors. There are similar derivations of Policy Gradient in Reinforcement Learning: An Introduction

Neil Slater
  • 28,338
  • 4
  • 77
  • 100