6

The Adam optimizer is often used for training neural networks; it typically avoids the need for hyperparameter search over parameters like the learning rate, etc. The Adam optimizer is an improvement on gradient descent.

I have a situation where I want to use projected gradient descent (see also here). Basically, instead of trying to minimize a function $f(x)$, I want to minimize $f(x)$ subject to the requirement that $x \ge 0$. Projected gradient descent works by clipping the value of $x$ after each iteration of gradient descent: each negative entry is replaced with 0, after each step.

Unfortunately, projected gradient descent seems to interact poorly with the Adam optimizer. I'm guessing that Adam's exponential moving average of the gradients gets messed up by the clipping. And plain projected gradient descent has hyperparameters that can be tuned.

Is there a version of Adam that can be used with projected gradient descent? I'm looking for a method that is an improvement on projected gradient descent, in the same way that Adam is an improvement on ordinary gradient descent (e.g., doesn't require hyperparameter tuning). Is there any such algorithm?

D.W.
  • 3,312
  • 15
  • 42
  • I have the exact same question. There are some theoretical guarantees for projected gradient descent (vanilla version), and also for adaptive algorithms, e.g.: http://arxiv.org/abs/1904.09237 proves convergence of ADAM even with a projection. But I have not seen any examples of people discussing this in practice. Have you? – a06e Jun 17 '20 at 00:01
  • related: https://stats.stackexchange.com/questions/422969/solving-constrained-optimization-problems-with-adam – a06e May 06 '22 at 19:22

2 Answers2

3

tl;dr: You should project the gradients before feeding them to Adam. You should also do the clipping afterwards in case of non-linear constraints and to avoid accumulation of numerical errors.

First, assuming normal SGD, for the specific case of (approximately) linear equality constraints, this can be done by

  1. projecting the gradient to the tangent space
  2. taking a step into that direction and
  3. projecting down again to the admissible set (only to correct for errors, numerical ones or due to non-linear constraints): $$ g_p = \pi_\top(g; x_n) \\ x_{n+1} = \pi_C(x_n - \alpha g_p) $$ where $g$ is the unprojected gradient, $\pi_\top(\,\cdot\,;x_n)$ projects to the tangent space at the current location $x_n$, $\pi_C(\,\cdot\,)$ projects to the admissible set, and $\alpha$ is the SGD learning rate.

In the general case, you will need to (also see the comments below and this thread)

  1. identify the search direction by taking an $\epsilon$ step in the true (negative) gradient direction
  2. project the result down to the admissible set $C$
  3. rescale appropriately to get an estimate of the projected gradient
  4. take the actual update step into that direction
  5. project down again to the admissible set.

The full update (for normal SGD) therefore becomes: $$ \widehat{g}_p = {\textstyle\frac{1}{\epsilon}}(x_n - \pi_C(x_n - \epsilon g)) \qquad [\text{steps 1–3}]\\ x_{n+1} = \pi_C(x_n - \alpha \widehat{g}_p)~, \qquad [\text{steps 4 and 5}] $$ where $g$ is the original gradient, $0<\epsilon\ll1$ is a small constant, $\widehat{g}_p$ is the estimate of the projected gradient, $\alpha$ is the SGD learning rate, and $\pi_C$ projects to the admissible set $C$.

Note that taking a small step $\epsilon$ computes a numerical approximation of the tangent space, which will work in a broader range of cases (e.g. for non-differentiable constraints). If you can, it will always be better (more exact and numerically stable) to analytically project the gradient. But this will obviously depend on your specific problem.

For Adam, you would use the projected gradient $g_p$ or its estimate $\widehat{g}_p$ and feed it to Adam. Adam then does its thing (estimating first/second moments, rescale/normalise gradients) and do a preliminary update step $$ \widetilde{x}_{n+1} = x_n - \alpha \widetilde{g}_p~, $$ where $\widetilde{g}_p$ is Adam's modified version of the projected gradient (based on the current and past gradients you fed it), and $\alpha$ is its learning rate. Last thing to do is the final projection: $$ x_n = \pi_C(\widetilde{x}_{n+1})~. $$

Background: The problem is not the exponential moving average but Adam's gradient normalisation based on the second moment. Essentially, the average gradient is rescaled by its average magnitude, which (for unconstraint problems) results in nice behaviour: If the magnitude is consistently small, gradients are scaled up to speed up convergence; if its huge, they are scaled down to avoid overshooting; if gradients are noisy around zero (small average gradient but large average magnitude), they are also scaled down, allowing for a precise convergence.

However, for projected gradients, this is a problem. The magnitude of the original gradient may be much larger than the magnitude of the projected gradient (especially close to an optimum on one of the constraints). If you do not project the gradients before feeding them the Adam, it will happily rescale everything with the magnitude of the original gradient, which basically means that things cannot ever converge. However, if you feed it the projected gradient, it will "see" its small magnitude, rescale it appropriately and things should be fine.

Edit: For linear equality constraints, it does not make a difference whether you first go into the unprojected gradient direction ($g$ below) and then project down, or you first project the gradient ($g_p$ below) and then take a step into that direction. For non-linear equality constraints, you would need to perform a second projection in the second case to correct for the errors due to the curvature.

For Adam, however, it makes a big difference because in the first case it sees a much larger gradient than in the second case. Adam tries to always make the effective gradient unit size, so it will scale down excessively in the first case.

projected gradient

  • Thank you for the answer! I want to check whether I understand. The standard projected gradient descent update rule (without Adam) is $x_{n+1} = \pi(x_n - \alpha \nabla f(x_n))$, where $\pi$ is the projection step. If I understand correctly, you are suggesting I compute $g = (x_n - \pi(x_n - \alpha \nabla f(x_n))/\alpha$, then feeding $g$ into Adam instead of $\nabla f(x_n)$? Is that right? What should I use as $\alpha$? (You refer to "projected gradient", but projected gradient descent does projection on the updated $x_n$, not on the gradient itself, hence why I'm asking.) – D.W. Apr 17 '21 at 21:23
  • I expanded my answer a bit... Yes, effectively I am suggesting exactly what you said. If the projection is very involved, you probably have to do it that way. Otherwise you can directly project the gradient down to the tangent space at $x_n$. The problems with the "standard" approach as well as this alternative are also mentioned in [this thread](https://math.stackexchange.com/questions/571068/what-is-the-difference-between-projected-gradient-descent-and-ordinary-gradient). With the adaptive gradients from Adam these problems are getting even worse for the reasons I explained. – user2700143 Apr 20 '21 at 21:50
  • Concerning the learning rate, for the projection as you wrote it, you basically want $\alpha$ as small as possible, which in the limit $\alpha\rightarrow0$ precisely corresponds to projecting to the tangent space. If you are asking about the learning rate of Adam, that very much depends on your loss function / optimisation landscape and how noisy your gradients are. It's hard to give general advice here. As a simple rule of thumb: As large as possible without the loss going crazy with first converging but then shooting up again :D – user2700143 Apr 20 '21 at 21:56
  • Projecting the gradient and then applying it is not equivalent to applying the gradient and then projecting the result. $x_{n+1} = \pi(x_n - \alpha \nabla f(x_n))$ is not the same as $x_{n+1} = x_n - \alpha \pi(\nabla f(x_n))$. Consider the case where $x_n$ is on one side of the constraint and $x_n - \alpha \nabla f(x_n)$ is on the other side of the constraint -- that case is not considered in your picture, and those two methods yield different results in that case. They are different even for linear constraints. – D.W. Apr 20 '21 at 22:15
  • Yes, that's correct, I should have been more precise: They are (approximately) equivalent for (approximately) linear *equality* constraints (or if $x_n$ lies on the constraint for other reasons). As I said, generally, you will need to do what you suggested, also because the tangent space may be ill-defined, e.g. for non-differentiable constraints. – user2700143 Apr 21 '21 at 08:19
  • I have extended my answer by concretely spelling out the updates, both for the special case of (approximately) linear equality constraints as well as for the general case. – user2700143 Apr 21 '21 at 09:12
  • Should "[steps 1-3]" be $-\hat{g}_p = \frac{1}{\epsilon} (x_n - \pi_C(x_n - \epsilon g))$? – D.W. Apr 21 '21 at 16:37
  • Should we have $\epsilon \approx \alpha$? Otherwise it seems like we can be in a situation where one of $x_n - \epsilon g$ and $x_n - \alpha g$ is in $C$ and the other is outside it, which distorts the gradient $\hat{g}_p$ fed to Adam. – D.W. Apr 21 '21 at 16:39
  • [steps 1-3] should be as I wrote. You are suggesting to put back in the sign error of your original suggestion :) To see this, just replace $\pi_C$ with the identity mapping. This is giving $-\widehat{g}_p=g$, which clearly has the wrong sign. – user2700143 Apr 21 '21 at 19:28
  • I wouldn't say that if $\epsilon \approx a$ things will necessarily go terribly wrong, but it does not make a lot of sense to me. First, they fulfill different roles: $\epsilon$ numerically approximates a derivative, while $a$ takes an actual step. Second, they are not comparable anyway, because Adam rescales the gradient and $\epsilon$ and $a$ thus act on vectors of different magnitude *and* direction. I also do not understand what you mean by "distorting the gradients": $\widehat{g}_p$ does not depend on $a$. – user2700143 Apr 21 '21 at 19:40
  • 1
    You should think of $\epsilon$ as the $d$ in $f^\prime(x) \approx \frac{f(x+d) - f(x)}{d}$ when taking the derivative. Ideally, you should take the limit $\epsilon \rightarrow 0$ analytically and if that's not possible you use $0<\epsilon\ll1$ for numerical approximation. – user2700143 Apr 21 '21 at 19:43
  • Btw if you have some crazy fractal permissible sets $C$, where infinitesimal steps can enter/leave $C$ with non-zero probability, you probably shouldn't be using gradient descent anyway :) – user2700143 Apr 21 '21 at 19:49
  • Sorry about the sign error, I meant: shouldn't "[steps 1-3]" be $\hat{g}_p = \frac{1}{\epsilon} (x_n - \pi_C(x_n - \epsilon g))$? You currently are missing the $x_n - \dots$ part. If we replace $\pi_C$ with the identity mapping with what you have currently written, [steps 1-3] become $\hat{g}_p = g - \frac{1}{\epsilon}x_n$, instead of $\hat{g}_p = g$. – D.W. Apr 21 '21 at 20:35
  • OK. Thanks for explaining the role of $\epsilon$. I understand. I wasn't talking about crazy sets $C$, I was imagining a box (say), where $x_n$ and $x_n - \epsilon g$ are in the box but $x_n - \alpha g$ are outside the box. But OK, I understand your point now about $\epsilon$ after more reflection -- sorry for being slow to catch on about that. – D.W. Apr 21 '21 at 20:38
  • Ah, yes, you are right. Sorry, I missed that. I updated my answer and double-checked, it should be correct now. – user2700143 Apr 22 '21 at 09:24
  • No worries :) There *are* very interesting and crazy applications, so I just wanted to be sure we are having the same basic situation in mind. Glad it makes sense now :) – user2700143 Apr 22 '21 at 09:53
  • I appreciate your careful analysis and your patience explaining it to me. This is great. Thank you for your answer! – D.W. Apr 22 '21 at 23:37
  • You are most welcome, happy I could provide a satisfying answer – 3 years after you posted the question :D Hope it helps with your application! – user2700143 Apr 23 '21 at 08:21
  • If one has non-linear equality constraints, is there a method for finding the projection operator $\pi_C$ into the admissible set? – a06e Sep 09 '21 at 15:41
  • Another question: Does there have to exist any relation between the two projectors $\pi_C$ and $\pi_T$? – a06e Sep 09 '21 at 16:16
  • For ADAM the modified gradient $\tilde g$ is generally not in the same direction as the original gradient. This is because ADAM scales different components of the gradient by different factors. Therefore the second projection $\pi_C$ is important because otherwise the resulting point will be outside the constraint, and not just by numerical error. – a06e Apr 05 '22 at 23:24
1

tl;dr: All gradient descent-based methods have an update rule similar to Adam's, so I think they'd all give you some grief. I don't know of any optimizers tailored to your application (but maybe you could invent one!)

Long answer:

According to this, Adam has four hyperparameters that typically don't require much tuning (although of course model hyperparameters will still need to be tuned, such as number of hidden units per layer, etc.)

  • alpha. Also referred to as the learning rate or step size. The proportion that weights are updated (e.g. 0.001). Larger values (e.g. 0.3) results in faster initial learning before the rate is updated. Smaller values (e.g. 1.0E-5) slow learning right down during training
  • beta1. The exponential decay rate for the first moment estimates (e.g. 0.9).
  • beta2. The exponential decay rate for the second-moment estimates (e.g. 0.999). This value should be set close to 1.0 on problems with a sparse gradient (e.g. NLP and computer vision problems).
  • epsilon. Is a very small number to prevent any division by zero in the implementation (e.g. 10E-8).

There are several other optimizers out there, such as AdaGrad and RMSprop. The idea from the 2015 paper on Adam was that by doing a combination of AdaGrad and RMSprop, you get a better optimizer (Adam).

The algorithm for Adam is explained in the 2015 paper (Algorithm 1). The exponential smoothing portion applies specifically to the (biased) estimates of the first and second moment. The only time your actual parameter value plays a role is in the calculation of the gradient, and then in the update.

In the update step (last part of the while loop in Algorithm 1), you perform this calculation:

x_t = x_{t-1} - alpha*(unbiased estimate of the first moment)/(sqrt(unbiased estimate of the second moment) + epsilon)

If you restrict x to be greater than or equal to 0, then if x_{t-1} ever becomes zero, you can no longer update (the subtraction of alpha*(stuff) would result in something less than 0, which you then clip to be 0, which means you'd get 0 as the result in each subsequent update step).

Maybe that's the source of the issue? If so, I would recommend restricting your entire parameter space to be above zero. That is, in your initializer, make sure the initial values are all positive (i.e. don't use Glorot/Xavier initialization, because you could get initial values less than 0, as this initializer is based on the Normal distribution).

StatsSorceress
  • 1,981
  • 3
  • 14
  • 30
  • Essentially you're trying to do nonlinear optimization (based on a neural network) with linear constraints (x >= 0). So try looking for resources on "nonlinear optimization with linear constraints". – StatsSorceress Dec 09 '18 at 17:12