After studying Q-learning, Sarsa & DQN I've now discovered a term "Policy Gradients".
It's a bit unclear to me how it differs to the above approaches. Here is my understanding, please correct it:
From the moment I first encountered DQN, I always imagined DQN's input vector as only consisting of current state's features. On the output layer of DQN we have a vector of actions. We then take the index of the highest-scored action $a$ and execute it. It puts us into the next state $s'$.
To compute the error (how incorrectly we estimated the score of $a$) we supply $s'$ into our DQN, then discover its highest action similar to 1. Once again, it will be on the output layer of DQN
Compute the error by bootstrapping $a$ to $a'$. The "error" vector will have zeros everywhere except at the index of that chosen action $a$
Was this in fact a "Policy Gradient", and not a Value-function approach as I thought initialy?
In that case, would the value-approach be a DQN with:
[state_features; action_one_hot_encoded_vector] concatenation as input, and a single value on the output?
I got this impression after reading this Link
Is the basic idea of DQN Policy Gradient that simple, or am I getting things wrong?
Edit: there is a really awesome lecture about policy grads. Unfortunatelly the video is deliberately unlisted, so that normal people can't really get to it - but I am for free education, so here it is: CS294-112 9/6/17