Convergence of vanilla or natural policy gradients (e.g. REINFORCE)

Question

I am reading in a lot of places that policy gradients, especially vanilla and natural, are at least guaranteed to converge to a local optimum (see, e.g., pg. 2 of Policy Gradient Methods for Robotics by Peters and Schaal), though the convergence rates are not specified.

I, however, cannot at all find a proof for this, and need to know if and why it is true. I am wondering the following:

What is the proof of this? Can someone point me to a reference?
What is the best convergence rate out there that's known (if any are proven)?
If I am reading at least the Peters/Schaal paper correctly, this only holds for discounted reward. Is there any formulation that works for the mean reward of a rollout? I am in a situation where I am running many simulations of a robot and most directly want to optimize its mean reward.

score 3 · Answer 1 · answered Jul 30 '18 at 08:53

What is the proof of this? Can someone point me to a reference?

It is called "The Policy Gradient Theorem", and a good reference would be Sutton & Barto Reinforcement Learning: An Introduction In the second edition, the theory behind REINFORCE and Actor-Critic is examined in Chapter 13.

In brief, the proof is focused on showing that a sample of a particular expression based on reward seen and the parameters of the policy function, is a sample of the gradient of the policy functions with respect to the parameters. It thus follows that a step in the direction of this sampled value, will in expectation be a step in the direction that increases expected sum of discounted reward (or the average expected reward) associated with that policy.

What is the best convergence rate out there that's known (if any are proven)?

This is not something that can be proven analytically, except in situations where you already know the loss function expressed in terms of the policy parameters (and nothing else that depends on them). Even for simple toy environments this is not really possible. The usual approach is to compare different algorithms experimentally, plotting learning curves. There is a lot of variance in practice, so learning curves are often plotted as averages of multiple training scenarios.

As far as I know, there is no single clear winner in a general sense between different policy gradient methods. However, Actor-Critic and similar methods that combine TD learning with policy gradients are usually preferred over the basic approach of REINFORCE, as the agents that estimate a value function allow "bootstrapped" updates to be made on every step. Although this adds initial bias (as value function estimator parameters initially have no relation to the true value function), it reduces variance and allows for more updates, thus often faster convergence.

Is there any formulation that works for the mean reward of a rollout?

The base theory of policy gradients applies to episodic, continuous discounted, and continuous average reward formulations with minor changes in the proof.

The algorithm REINFORCE cannot be applied to a continuous environment, as updates are made only at the end of episodes. Actor-Critic approaches can be adapted to average reward formulation - for examples see this survey.

Can you please expand on a few sections here? In particular: a) I don't see the proof of convergence for REINFORCE in chapter 13. Is it elsewhere? Further, is convergence in iterate or in value? A decreasing sequence by itself is not sufficient to guarantee convergence to a local optimum. b) Actor-Critic approaches (and, I presume, policy gradient?) can be adapted to average reward formulation - but isn't that at odds with finite bound on the convergence of the quadratic series? In such a case, I would think all weights on reward would be constant. — user650261, Jul 30 '18 at 16:49
@user650261: I guess the proof is of the correctness of the gradient in expectation. At that point, proof of convergence is same as any other stochastic gradient method. Perhaps this is closer what you are looking for: https://perso.telecom-paristech.fr/rgower/pdf/M2_statistique_optimisation/grad_conv.pdf — Neil Slater, Jul 30 '18 at 16:53
I don't understand your comment marked (b) so not able to expand or suggest improvements, sorry. I don't understand your comment marked (b) so not able to expand or suggest improvements, sorry. All I can say with any certainty is that the policy gradient theorem works with the three different formulations of goals based on reward, as in the answer. If the expected value of the sample is the gradient, then stochastic gradient ascent based on those samples should converge to locally optimal values. — Neil Slater, Jul 30 '18 at 16:54
t is not obvious to me that gradient descent is guaranteed to converge in the case where, say, policies are stochastic, or when the dataset is a moving target. I was looking for a direct proof more akin to Theorem 3 in http://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf but without the function approximation. But, perhaps I am overcomplicating things? — user650261, Jul 31 '18 at 02:05
@user650261: Policy Gradients don't have a "without function approximation" formulation (unlike value-based methods, which can be tabular). They *must* start with a parametric function relating action to state e.g. $\pi(a|s, \theta)$ — Neil Slater, Jul 31 '18 at 08:18

Convergence of vanilla or natural policy gradients (e.g. REINFORCE)

1 Answers1

Linked