Need to kickstart learning rates

Question

I was just looking at the docs on Pytorch for different available schedulers and I found one that I am having some trouble understanding here.

The others seem to make sense: As training progresses, the learning rate gradually decreases. But in my study so far, I am yet to come across a model that needs such a "kickstart" mechanism.

Could someone please help me figure out why we need this?

score 2 · Accepted Answer · answered Nov 04 '20 at 05:03

This technique of gradually increasing the learning rate linearly during early phase of training is called learning rate warmup.

This is typically needed in order to prevent overfitting during the initial phase of training where a certain subset of observations may significantly skew your model towards some features and push you towards a bad local optimum.

The rectified adam paper suggests that this warmup heuristic serves as a variance reduction technique.

Due to the lack of samples in the early stage, the adaptive learning rate has an undesirably large variance, which leads to suspicious/bad local optima

They perform a few experiments that validates their hypothesis.

Need to kickstart learning rates

1 Answers1