Questions tagged [policy-gradients]
60 questions
23
votes
2 answers
Formal proof of vanilla policy gradient convergence
So I stumbled upon this question, where the author asks for a proof of vanilla policy gradient procedures. The answer provided points to some literature, but the formal proof is nowhere to be included. Looking at Sutton,Barto- Reinforcement…
Markus Peschl
- 280
- 1
- 7
6
votes
1 answer
Reinforcement Learning: Policy Gradient derivation question
I have been reading this excellent post: https://medium.com/@jonathan_hui/rl-policy-gradients-explained-9b13b688b146 and following the RL-videos by David Silver, and I did not get this thing:
For $\pi_\theta(\tau) = \pi_\theta(s_1, a_1, ..., s_T,…
Hadamard
- 63
- 4
5
votes
1 answer
RL's policy gradient (REINFORCE) pipeline clarification
I try to build a policy gradient RL machine, and let's look at the REINFORCE's equation for updating the model parameters by taking a gradient to make the ascent (I apologize if notation is slightly non-conventional):
$$\omega = \omega + \alpha…
Alexey Burnakov
- 233
- 2
- 11
5
votes
1 answer
Policy Gradients - gradient Log probabilities favor less likely actions?
Assume we work with neural networks, with the policy gradients method. The gradient w.r.t to the objective function $J$, is an expectation.
In other words, to get this gradient $\nabla_{\theta} J(\theta)$, we sample N trajectories, then average out…
Kari
- 2,686
- 1
- 17
- 47
4
votes
2 answers
Agent always takes a same action in DQN - Reinforcement Learning
I have trained an RL agent using DQN algorithm. After 20000 episodes my rewards are converged. Now when I test this agent, the agent is always taking the same action , irrespective of state. I find this very weird. Can someone help me with this. Is…
chink
- 555
- 9
- 17
4
votes
2 answers
Reinforcement learning: Discounting rewards in the REINFORCE algorithm
I am looking into the REINFORCE algorithm for reinforcement learning. I am having trouble understanding how rewards should be computed.
The algorithm from Sutton & Barto:
What does G, 'return from step t' mean here?
Return from step t to step T-1,…
Atuos
- 317
- 1
- 2
- 7
4
votes
1 answer
Time horizon T in policy gradients (actor-critic)
I am currently going through the Berkeley lectures on Reinforcement Learning. Specifically, I am at slide 5 of this lecture.
At the bottom of that slide, the gradient of the expected sum of rewards function is given by
$$
\nabla J(\theta) =…
Dummie Variable
- 86
- 2
4
votes
1 answer
Policy-based RL method - how do continuous actions look like?
I've read several times that Policy-based RL methods can work with continuous action space (move left 5 meters, move right 5.5312 meters), rather than with discrete actions, like Value-based methods (Q-learning)
If Policy-based methods produce…
Kari
- 2,686
- 1
- 17
- 47
4
votes
1 answer
How does action get selected in a Policy Gradient Method?
As I understood, in Reinforcement-Learning a big difference between a Value-based method and a Policy-gradient method is how the next action is selected.
In Q-learning (Value-based method), each possible action gets a score. We then select next…
Kari
- 2,686
- 1
- 17
- 47
3
votes
1 answer
Which Policy Gradient Method was used by Google's Deep Mind to teach AI to walk
I just saw this video on Youtube.
Which Policy Gradient method was used to train the AI to walk?
Was it DDPG or D4PG or what?
learner
- 33
- 2
3
votes
1 answer
Policy Gradient not "learning"
I'm attempting to implement the policy gradient taken from the "Hands-On Machine Learning" book by Geron, which can be found here. The notebook uses Tensorflow and I'm attempting to do it with PyTorch.
My models look as follows:
model =…
Harpal
- 903
- 1
- 7
- 13
3
votes
1 answer
Maximum Entropy Policy Gradient Derivation
I am reading through the paper on Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review by Sergey Levine. I am having a difficulty in understanding this part of the derivation on Maximum Entropy Policy Gradients (Section…
Ricky Sanjaya
- 39
- 3
3
votes
1 answer
reinforcement learning: Decompose a policy gradient
I am studying the policy gradient through the website: https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f
Couldn't figure out how the first equation becomes the second equation?
In the second equation, why the first…
Edamame
- 2,705
- 5
- 23
- 32
3
votes
1 answer
Policy Gradients vs Value function, when implemented via DQN
After studying Q-learning, Sarsa & DQN I've now discovered a term "Policy Gradients".
It's a bit unclear to me how it differs to the above approaches. Here is my understanding, please correct it:
From the moment I first encountered DQN, I always…
Kari
- 2,686
- 1
- 17
- 47
2
votes
1 answer
Policy Gradient with continuous action space
How to apply reinforce/policy-gradient algorithms for continuous action space. I have learnt that one of the advantages of policy gradients is , it is applicable for continuous action space. One way I can think of is discretizing the action space…
chink
- 555
- 9
- 17