Isn't the optimizer network in deepminds learning to learn a DRQN?

Question

In the paper "Learning to learn by gradient descent by gradient descent" they describe an RNN which learns gradient transformation to learn an optimizer.

The optimizer network directly interacts with the environment to take actions,

$\theta_{t+1} = \theta_t + g_t(∇f(\theta_t), \phi).$ (Equation 1 from the paper)

and hence feels like a reinforcement learning problem in continuous action space. The formulation of optimization equation looks like what one would typical do in a supervised learning problem,

$L(\phi) = E_f[f(\theta^*(f,\phi))]$ (Equation 2 from the paper)

Is this an indirect formulation of policy gradient?

What I mean is, the optimizer is directly interacting with the environment and the objective function definition is almost like what we could call a reward function. So shouldn't it be considered as reinforcement learning? And if so, using an LSTM would imply that we are not treating it as an MDP. But the paper neither calls it's method as Q learning or the assumptions made around modeling partially observable MDP. Am I getting it wrong? — Amey Agrawal, Dec 02 '17 at 18:05
I don't know whether you are getting things right or wrong. Mostly I am confused about your question. Could you please edit the text to avoid making statements about the paper which are not factually correct? Your question contains the text " The paper refers to this as supervised learning ". But it does not. This might be an interesting question about relationship of the work in the paper to reinforcement learning, but you should give more details from the paper (a v short summary) and why you suppose it is or isn't RL . . . — Neil Slater, Dec 02 '17 at 18:25
Sorry for the inaccurate description. Paper never formally refers the method either as supervised or reinforcement learning but rather as *learning with gradient decent* . But it does refer to Daniel et al.'s as reinforcement learning. I have updated the question description to refer the equations based on which I assumed it to be supervised learning. Thank you. — Amey Agrawal, Dec 02 '17 at 18:56
That's better. However, you give references to "Equation 1" and "Equation 2" - first we should really copy them across to the question. I have done so, based on the numbering in the paper, but I'm sure those equations support your statements - could you verify those are the ones you meant? — Neil Slater, Dec 02 '17 at 19:25
Edit for the last comment: Tried to make it more clear in the last edit. While the equation 2 doesn't explicitly imply that it is formulated as supervised learning, in a typical reinforcement learning problem description we would have named $-f$ as reward function and $w_t$ as discount factor. Also, none of the typical reinforcement learning like epsilon greedy exploration or action replay are used. — Amey Agrawal, Dec 02 '17 at 19:50
OK, I think it is a better question after the edits. In fact I think it is a good and interesting question! Understanding the paper is beyond me for now (I'm working on understanding RL, but have a way to go before interpreting recent papers, or having enough clarity in my own mind to answer your question on this one). But I hope with the added details and focus in the question that someone will be able to answer it. — Neil Slater, Dec 02 '17 at 19:55

Isn't the optimizer network in deepminds learning to learn a DRQN?

0 Answers0