(Copied from my answer on stats.SE. (This meta.SE answer seems to approve of this copy-paste pattern.))
The original Adam paper briefly explains what it means by "invariant to diagonal rescaling of the gradients" at the end of section 2.1.
I would try to explain it in some more detail.
Like stochastic gradient descent (SGD), Adam is an iterative method that uses gradients in order to find a minimum of a function.
(By "gradients" I mean "the values of the gradient in different locations in parameter space". I later use "partial derivatives" in a similar fashion.)
But in contrast to SGD, Adam doesn't really use gradients. Instead, Adam uses the partial derivatives of each parameter independently.
(By "partial derivative of a parameter $x$" I mean "partial derivative of the cost function $C$ with respect to $x$", i.e. $\frac{\partial C}{\partial x}$.)
Let $\Delta^{(t)}$ be the step that Adam takes in parameter space in the $t^{\text{th}}$ iteration. Then the step it takes in the dimension of the $j^{\text{th}}$ parameter (in the $t^{\text{th}}$ iteration) is $\Delta^{(t)}_j$, which is given by:
$$\Delta^{(t)}_j=-\frac{\alpha}{\sqrt{{\hat v}^{(t)}_j}+\epsilon}\cdot {\hat m}^{(t)}_j$$
while:
- $\alpha$ is the learning rate hyperparameter.
- $\epsilon$ is a small hyperparameter to prevent division by zero.
- ${\hat m}^{(t)}_j$ is an exponential moving average of the partial derivatives of the $j^{\text{th}}$ parameter that were calculated in iterations $1$ to $t$.
- ${\hat v}^{(t)}_j$ is an exponential moving average of the squares of the partial derivatives of the $j^{\text{th}}$ parameter that were calculated in iterations $1$ to $t$.
Now, what happens when we scale the partial derivative of the $j^{\text{th}}$ parameter by a positive factor $c$?
(I.e. the partial derivative of the $j^{\text{th}}$ parameter is just a function whose domain is the parameter space, so we can simply multiply its value by $c$.)
- ${\hat m}^{(t)}_j$ becomes $c\cdot{\hat m}^{(t)}_j$
- ${\hat v}^{(t)}_j$ becomes $c^2\cdot{\hat v}^{(t)}_j$
- Thus (using the fact that $c>0$), we get that $\Delta^{(t)}_j$ becomes:
$$-\frac{\alpha}{\sqrt{c^2\cdot{\hat v}^{(t)}_j}+\epsilon}\cdot c\cdot{\hat m}^{(t)}_j=-\frac{\alpha}{c\cdot\sqrt{{\hat v}^{(t)}_j}+\epsilon}\cdot c\cdot{\hat m}^{(t)}_j$$
And assuming $\epsilon$ is very small, we get:
$$\begin{gathered}
-\frac{\alpha}{c\cdot\sqrt{{\hat v}^{(t)}_j}+\epsilon}\cdot c\cdot{\hat m}^{(t)}_j\approx -\frac{\alpha}{c\cdot\sqrt{{\hat v}^{(t)}_j}}\cdot c\cdot{\hat m}^{(t)}_j=\\
-\frac{\alpha}{\sqrt{{\hat v}^{(t)}_j}}\cdot{\hat m}^{(t)}_j\approx-\frac{\alpha}{\sqrt{{\hat v}^{(t)}_j}+\epsilon}\cdot{\hat m}^{(t)}_j
\end{gathered}$$
I.e. scaling the partial derivative of the $j^{\text{th}}$ parameter by a positive factor $c$ actually doesn't affect $\Delta^{(t)}_j$.
Finally, let $g=\left(\begin{gathered}g_{1}\\
g_{2}\\
\vdots
\end{gathered}
\right)$ be the gradient. Then $g_j$ is the partial derivative of the $j^{\text{th}}$ parameter.
What happens when we multiply the gradient by a diagonal matrix with only positive elements?
$$\left(\begin{matrix}c_{1}\\
& c_{2}\\
& & \ddots
\end{matrix}\right)g=\left(\begin{matrix}c_{1}\\
& c_{2}\\
& & \ddots
\end{matrix}\right)\left(\begin{gathered}g_{1}\\
g_{2}\\
\vdots
\end{gathered}
\right)=\left(\begin{gathered}c_{1}\cdot g_{1}\\
c_{2}\cdot g_{2}\\
\vdots
\end{gathered}
\right)$$
So it would only scale each partial derivative by a positive factor, but as we have seen above, this won't affect the steps that Adam takes.
In other words, Adam is invariant to multiplying the gradient by a diagonal matrix with only positive factors, which is what the paper means by "invariant to diagonal rescaling of the gradients".