What does it mean for a method to be invariant to diagonal rescaling of the gradients?

Question

In the paper which describes Adam: a method for stochastic optimization, the author states:

The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters.

What does "invariant to diagonal rescaling of the gradients" imply here?

score 1 · Answer 1 · answered Oct 06 '18 at 08:20

(Copied from my answer on stats.SE. (This meta.SE answer seems to approve of this copy-paste pattern.))

The original Adam paper briefly explains what it means by "invariant to diagonal rescaling of the gradients" at the end of section 2.1.

I would try to explain it in some more detail.

Like stochastic gradient descent (SGD), Adam is an iterative method that uses gradients in order to find a minimum of a function.
(By "gradients" I mean "the values of the gradient in different locations in parameter space". I later use "partial derivatives" in a similar fashion.)

But in contrast to SGD, Adam doesn't really use gradients. Instead, Adam uses the partial derivatives of each parameter independently.
(By "partial derivative of a parameter $x$" I mean "partial derivative of the cost function $C$ with respect to $x$", i.e. $\frac{\partial C}{\partial x}$.)

Let $\Delta^{(t)}$ be the step that Adam takes in parameter space in the $t^{\text{th}}$ iteration. Then the step it takes in the dimension of the $j^{\text{th}}$ parameter (in the $t^{\text{th}}$ iteration) is $\Delta^{(t)}_j$, which is given by: $$\Delta^{(t)}_j=-\frac{\alpha}{\sqrt{{\hat v}^{(t)}_j}+\epsilon}\cdot {\hat m}^{(t)}_j$$ while:

$\alpha$ is the learning rate hyperparameter.
$\epsilon$ is a small hyperparameter to prevent division by zero.
${\hat m}^{(t)}_j$ is an exponential moving average of the partial derivatives of the $j^{\text{th}}$ parameter that were calculated in iterations $1$ to $t$.
${\hat v}^{(t)}_j$ is an exponential moving average of the squares of the partial derivatives of the $j^{\text{th}}$ parameter that were calculated in iterations $1$ to $t$.

Now, what happens when we scale the partial derivative of the $j^{\text{th}}$ parameter by a positive factor $c$?
(I.e. the partial derivative of the $j^{\text{th}}$ parameter is just a function whose domain is the parameter space, so we can simply multiply its value by $c$.)

${\hat m}^{(t)}_j$ becomes $c\cdot{\hat m}^{(t)}_j$
${\hat v}^{(t)}_j$ becomes $c^2\cdot{\hat v}^{(t)}_j$
Thus (using the fact that $c>0$), we get that $\Delta^{(t)}_j$ becomes: $$-\frac{\alpha}{\sqrt{c^2\cdot{\hat v}^{(t)}_j}+\epsilon}\cdot c\cdot{\hat m}^{(t)}_j=-\frac{\alpha}{c\cdot\sqrt{{\hat v}^{(t)}_j}+\epsilon}\cdot c\cdot{\hat m}^{(t)}_j$$ And assuming $\epsilon$ is very small, we get: $$\begin{gathered} -\frac{\alpha}{c\cdot\sqrt{{\hat v}^{(t)}_j}+\epsilon}\cdot c\cdot{\hat m}^{(t)}_j\approx -\frac{\alpha}{c\cdot\sqrt{{\hat v}^{(t)}_j}}\cdot c\cdot{\hat m}^{(t)}_j=\\ -\frac{\alpha}{\sqrt{{\hat v}^{(t)}_j}}\cdot{\hat m}^{(t)}_j\approx-\frac{\alpha}{\sqrt{{\hat v}^{(t)}_j}+\epsilon}\cdot{\hat m}^{(t)}_j \end{gathered}$$

I.e. scaling the partial derivative of the $j^{\text{th}}$ parameter by a positive factor $c$ actually doesn't affect $\Delta^{(t)}_j$.

Finally, let $g=\left(\begin{gathered}g_{1}\\ g_{2}\\ \vdots \end{gathered} \right)$ be the gradient. Then $g_j$ is the partial derivative of the $j^{\text{th}}$ parameter.

What happens when we multiply the gradient by a diagonal matrix with only positive elements? $$\left(\begin{matrix}c_{1}\\ & c_{2}\\ & & \ddots \end{matrix}\right)g=\left(\begin{matrix}c_{1}\\ & c_{2}\\ & & \ddots \end{matrix}\right)\left(\begin{gathered}g_{1}\\ g_{2}\\ \vdots \end{gathered} \right)=\left(\begin{gathered}c_{1}\cdot g_{1}\\ c_{2}\cdot g_{2}\\ \vdots \end{gathered} \right)$$ So it would only scale each partial derivative by a positive factor, but as we have seen above, this won't affect the steps that Adam takes.

In other words, Adam is invariant to multiplying the gradient by a diagonal matrix with only positive factors, which is what the paper means by "invariant to diagonal rescaling of the gradients".

What does it mean for a method to be invariant to diagonal rescaling of the gradients?

1 Answers1