Reinforcement learning: Discounting rewards in the REINFORCE algorithm

Question

I am looking into the REINFORCE algorithm for reinforcement learning. I am having trouble understanding how rewards should be computed.

The algorithm from Sutton & Barto:

What does G, 'return from step t' mean here?

Return from step t to step T-1, i.e. R_t + R_(t+1) + ... + R_(T-1)?
Return from step 0 to step t?, i.e. R_0 + R_1 + ... + R_(t)?

score 3 · Accepted Answer · edited Jun 16 '20 at 11:08

What does G, 'return from step t' mean here?

Return from step t to step T-1, i.e. R_t + R_(t+1) + ... + R_(T-1)?

Return from step 0 to step t?, i.e. R_0 + R_1 + ... + R_(t)?

Neither, but (1) is closest.

$$G_t = \sum_{i=t+1}^T R_i$$

i.e. the sum of all rewards from step $t+1$ to step $T$.

You are possibly confused because the loop for REINFORCE goes from $0$ to $T-1$. However, that makes sense due to the one step offset from return to the sum of rewards. So $G_{T-1} = R_T$ and $G_{T} = 0$ always (there is no future reward possible at the end of the episode).

score 2 · Answer 2 · edited Aug 20 '19 at 12:55

2

From the latest version of the book, where G is explicitly defined, and similar to Neil Slater's answer, $G_t \leftarrow$ return from step $t$ is:

$$ G_t = \sum_{k=t+1}^T \gamma^{k-t-1}R_k $$

edited Aug 20 '19 at 12:55

Stephen Rauch

1,783
11
21
34

answered Aug 20 '19 at 11:56

Roberto Vázquez Lucerga

21
2

Reinforcement learning: Discounting rewards in the REINFORCE algorithm

2 Answers2