Questions tagged [policy-gradients]

60 questions
23
votes
2 answers

Formal proof of vanilla policy gradient convergence

So I stumbled upon this question, where the author asks for a proof of vanilla policy gradient procedures. The answer provided points to some literature, but the formal proof is nowhere to be included. Looking at Sutton,Barto- Reinforcement…
6
votes
1 answer

Reinforcement Learning: Policy Gradient derivation question

I have been reading this excellent post: https://medium.com/@jonathan_hui/rl-policy-gradients-explained-9b13b688b146 and following the RL-videos by David Silver, and I did not get this thing: For $\pi_\theta(\tau) = \pi_\theta(s_1, a_1, ..., s_T,…
Hadamard
  • 63
  • 4
5
votes
1 answer

RL's policy gradient (REINFORCE) pipeline clarification

I try to build a policy gradient RL machine, and let's look at the REINFORCE's equation for updating the model parameters by taking a gradient to make the ascent (I apologize if notation is slightly non-conventional): $$\omega = \omega + \alpha…
Alexey Burnakov
  • 233
  • 2
  • 11
5
votes
1 answer

Policy Gradients - gradient Log probabilities favor less likely actions?

Assume we work with neural networks, with the policy gradients method. The gradient w.r.t to the objective function $J$, is an expectation. In other words, to get this gradient $\nabla_{\theta} J(\theta)$, we sample N trajectories, then average out…
Kari
  • 2,686
  • 1
  • 17
  • 47
4
votes
2 answers

Agent always takes a same action in DQN - Reinforcement Learning

I have trained an RL agent using DQN algorithm. After 20000 episodes my rewards are converged. Now when I test this agent, the agent is always taking the same action , irrespective of state. I find this very weird. Can someone help me with this. Is…
chink
  • 555
  • 9
  • 17
4
votes
2 answers

Reinforcement learning: Discounting rewards in the REINFORCE algorithm

I am looking into the REINFORCE algorithm for reinforcement learning. I am having trouble understanding how rewards should be computed. The algorithm from Sutton & Barto: What does G, 'return from step t' mean here? Return from step t to step T-1,…
Atuos
  • 317
  • 1
  • 2
  • 7
4
votes
1 answer

Time horizon T in policy gradients (actor-critic)

I am currently going through the Berkeley lectures on Reinforcement Learning. Specifically, I am at slide 5 of this lecture. At the bottom of that slide, the gradient of the expected sum of rewards function is given by $$ \nabla J(\theta) =…
4
votes
1 answer

Policy-based RL method - how do continuous actions look like?

I've read several times that Policy-based RL methods can work with continuous action space (move left 5 meters, move right 5.5312 meters), rather than with discrete actions, like Value-based methods (Q-learning) If Policy-based methods produce…
Kari
  • 2,686
  • 1
  • 17
  • 47
4
votes
1 answer

How does action get selected in a Policy Gradient Method?

As I understood, in Reinforcement-Learning a big difference between a Value-based method and a Policy-gradient method is how the next action is selected. In Q-learning (Value-based method), each possible action gets a score. We then select next…
Kari
  • 2,686
  • 1
  • 17
  • 47
3
votes
1 answer

Which Policy Gradient Method was used by Google's Deep Mind to teach AI to walk

I just saw this video on Youtube. Which Policy Gradient method was used to train the AI to walk? Was it DDPG or D4PG or what?
3
votes
1 answer

Policy Gradient not "learning"

I'm attempting to implement the policy gradient taken from the "Hands-On Machine Learning" book by Geron, which can be found here. The notebook uses Tensorflow and I'm attempting to do it with PyTorch. My models look as follows: model =…
3
votes
1 answer

Maximum Entropy Policy Gradient Derivation

I am reading through the paper on Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review by Sergey Levine. I am having a difficulty in understanding this part of the derivation on Maximum Entropy Policy Gradients (Section…
3
votes
1 answer

reinforcement learning: Decompose a policy gradient

I am studying the policy gradient through the website: https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f Couldn't figure out how the first equation becomes the second equation? In the second equation, why the first…
Edamame
  • 2,705
  • 5
  • 23
  • 32
3
votes
1 answer

Policy Gradients vs Value function, when implemented via DQN

After studying Q-learning, Sarsa & DQN I've now discovered a term "Policy Gradients". It's a bit unclear to me how it differs to the above approaches. Here is my understanding, please correct it: From the moment I first encountered DQN, I always…
Kari
  • 2,686
  • 1
  • 17
  • 47
2
votes
1 answer

Policy Gradient with continuous action space

How to apply reinforce/policy-gradient algorithms for continuous action space. I have learnt that one of the advantages of policy gradients is , it is applicable for continuous action space. One way I can think of is discretizing the action space…
chink
  • 555
  • 9
  • 17
1
2 3 4