Policy Gradients vs Value function, when implemented via DQN

Question

After studying Q-learning, Sarsa & DQN I've now discovered a term "Policy Gradients".

It's a bit unclear to me how it differs to the above approaches. Here is my understanding, please correct it:

From the moment I first encountered DQN, I always imagined DQN's input vector as only consisting of current state's features. On the output layer of DQN we have a vector of actions. We then take the index of the highest-scored action $a$ and execute it. It puts us into the next state $s'$.
To compute the error (how incorrectly we estimated the score of $a$) we supply $s'$ into our DQN, then discover its highest action similar to 1. Once again, it will be on the output layer of DQN
Compute the error by bootstrapping $a$ to $a'$. The "error" vector will have zeros everywhere except at the index of that chosen action $a$

Was this in fact a "Policy Gradient", and not a Value-function approach as I thought initialy?

In that case, would the value-approach be a DQN with:

[state_features; action_one_hot_encoded_vector] concatenation as input, and a single value on the output?

I got this impression after reading this Link

Is the basic idea of DQN Policy Gradient that simple, or am I getting things wrong?

Edit: there is a really awesome lecture about policy grads. Unfortunatelly the video is deliberately unlisted, so that normal people can't really get to it - but I am for free education, so here it is: CS294-112 9/6/17

score 1 · Accepted Answer · answered Jul 18 '18 at 09:23

Is the basic idea of DQN Policy Gradient that simple, or am I getting things wrong?

This is not correct. DQN is a value-function approach, as you thought initially.

Your confusion seems to stem from the two options for action representation possible in DQN. You can either estimate a single $Q(s,a)$ function by having $a$ as an input to a neural network, or can estimate all possible $Q(s, *)$ with multiple outputs, one for each $a$. However, that is an implementation detail in how you set up the neural network. It might change the efficiency of the resulting system, but it doesn't change the nature of the RL algorithm at all.

Policy gradient methods are based around modifying a parametric policy function $\pi(a|,s,\theta)$ and learning $\theta$. The most basic policy gradient algorithm is REINFORCE, which requires an episodic problem, and updates on-policy after each episode.

Importantly, you cannot use the relationship between $Q(s,a)$ and resulting policy in Q-learning to create a policy gradient approach. The main blocker to doing this is that the effective policy $\text{argmax}_a \hat{q}(s,a,\theta)$ is not differentiable with respect to $\theta$, so it is not possible to calculate a gradient and make an update to improve the policy.

In general, if an algorithm learns a state value $V(s)$ or action value $Q(s,a)$ and then uses that to decide the policy, then it is value-based. If it learns the policy directly then it may be a policy gradient method (in RL terms it is very likely to be policy gradient, but you can also do policy search using e.g. genetic algorithms).

Policy gradient methods also include Actor-Critic approaches which learn both the policy and an associated value function (usually state value $V(s)$). This is a more advanced algorithm than REINFORCE, in that it can be applied to continuous (non-episodic) problems, and updates estimates on every step. One popular Actor-Critic approach is A3C

Policy Gradients vs Value function, when implemented via DQN

1 Answers1