2

How to apply reinforce/policy-gradient algorithms for continuous action space. I have learnt that one of the advantages of policy gradients is , it is applicable for continuous action space. One way I can think of is discretizing the action space same as the way we do it for dqn. Should we follow the same method for policy -gradient algorithms also ? Or is there any other way this is done?

Thanks

chink
  • 555
  • 9
  • 17
  • I'd recommend checking out this related question/answer: https://datascience.stackexchange.com/a/25212/75152 That answer contains a good description of how policy gradient can be applied to a continuous action space. – zachdj Oct 14 '19 at 13:10
  • Hi, thanks for the link, i had a little difficulty understanding some parts, so what I want to confirm is, even in policy gradients for continuous actions we discretize them into discrete actions. Is my understanding correct? – chink Oct 14 '19 at 14:06
  • No that's not correct - there's no need to discretize the action space with policy gradient. For discrete problems, the policy returns a vector of probabilities for each action. But in continuous problems, the policy returns a _continuous_ distribution over the action space. – zachdj Oct 14 '19 at 14:31
  • okay, thanks!! @zachdj – chink Oct 14 '19 at 15:18

1 Answers1

3

Yes, that is possible. It can be done in the following way:

We assume that the action distribution is guassian, i.e. that we need to learn the parameters $\theta$ of $\mathcal{N}(a|\mu_\theta,\sigma_\theta)$. Let's say that $\theta$ is given by the weights of a neural network, which we find by optimizing the objective $$\max_\theta \mathbb{E}_{p_{\theta}}\left[ R(s,a)p_\theta(a|s)\right],$$ where $p_\theta(s,a) = \mathcal{N}(a|\mu_\theta, \sigma_\theta)$ and $R(s,a)$ is the cumulative discounted reward. The gradient is then per policy gradient theorem simply $\mathbb{E}_{p_{\theta}}\left[\nabla_\theta R(s,a) \log p_\theta(a|s) \right]$.

In practice, we design a neural network to output one $\mu$ per action dimension and $\sigma$ can either be learned or kept fixed. If learned, we interpret the output as $\log \sigma$, so it can take any value (e.g., become negative). To sample the action we use the outputs learned by our network.

See e.g. T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. “Continuous control with deep reinforcement learning,” International Conference on Learning Representations, 2016.

hh32
  • 2,632
  • 1
  • 9
  • 9
  • Thank you for your answer. It's really helpful. But I'm not sure what do you mean by en value? :'In practice we design a neural network to output one μ per action dimension and σ can either be learned or kept fixed. If learned, interpret the output as logσ, so it can take en value.' – Eman.suradi Jan 14 '22 at 20:22
  • 1
    It was an error. I fixed it. – hh32 Jan 21 '22 at 13:30