What is the proof of this? Can someone point me to a reference?
It is called "The Policy Gradient Theorem", and a good reference would be Sutton & Barto Reinforcement Learning: An Introduction In the second edition, the theory behind REINFORCE and Actor-Critic is examined in Chapter 13.
In brief, the proof is focused on showing that a sample of a particular expression based on reward seen and the parameters of the policy function, is a sample of the gradient of the policy functions with respect to the parameters. It thus follows that a step in the direction of this sampled value, will in expectation be a step in the direction that increases expected sum of discounted reward (or the average expected reward) associated with that policy.
What is the best convergence rate out there that's known (if any are proven)?
This is not something that can be proven analytically, except in situations where you already know the loss function expressed in terms of the policy parameters (and nothing else that depends on them). Even for simple toy environments this is not really possible. The usual approach is to compare different algorithms experimentally, plotting learning curves. There is a lot of variance in practice, so learning curves are often plotted as averages of multiple training scenarios.
As far as I know, there is no single clear winner in a general sense between different policy gradient methods. However, Actor-Critic and similar methods that combine TD learning with policy gradients are usually preferred over the basic approach of REINFORCE, as the agents that estimate a value function allow "bootstrapped" updates to be made on every step. Although this adds initial bias (as value function estimator parameters initially have no relation to the true value function), it reduces variance and allows for more updates, thus often faster convergence.
Is there any formulation that works for the mean reward of a rollout?
The base theory of policy gradients applies to episodic, continuous discounted, and continuous average reward formulations with minor changes in the proof.
The algorithm REINFORCE cannot be applied to a continuous environment, as updates are made only at the end of episodes. Actor-Critic approaches can be adapted to average reward formulation - for examples see this survey.