I have a continuous reinforcement learning problem for which I use policy gradients and I use a baseline to decrease the variance of the gradients.
The baseline that I used is the moving average of the rewards obtained during the last 10 time steps just before the current time step.
I have 2 questions regarding the baseline as follows.
- Would it be better to use a frozen baseline for some time steps, e.g., keeping the baseline constant for the next 5 time steps? My motivation for that is due to the frozen target Q-values in DQNs. The effect of the baseline to gradient updates is actually zero, but as it has an impact of the reward, which can determine the direction of the gradient updates, can it somehow help stabilize the training of the network and improve the results?
- Is there a rule of thumb for the selection of moving average window, like a ratio considering the number of total epochs etc.?