1

What do warm steps and warmup proportion mean? how to select the number of warmup steps?

Learning rate changes for each batch or each epoch for warmup step=1 ?

SS Varshini
  • 239
  • 4
  • 13
  • 2
    Does this answer your question? [In the context of Deep Learning, what is training warmup steps](https://datascience.stackexchange.com/questions/55991/in-the-context-of-deep-learning-what-is-training-warmup-steps) – Mr. Panda Oct 11 '21 at 10:55
  • I got it but my question is warm step mean per batch or per layer or per epoch? – SS Varshini Oct 12 '21 at 04:54

2 Answers2

3

Answering your four questions

  1. Warm up steps: Its used to indicate set of training steps with very low learning rate
  2. Warm up proportion ($wu$): Its the proportion of number of warmup steps to the total number of steps 3 Selecting the number of warmup steps varies depending on each case.
    • This research paper discusses warmup steps with 0%, 2%, 4%, and 6%, which are all reflect significantly fewer warmup steps than in BERT.
    • This particular user had better performance with warmup steps of 165k. Kindly refer to this forum
  3. As per this deep-learning documentation its warmup per epoch

enter image description here

References:

Archana David
  • 1,189
  • 4
  • 21
2

I will quote from several well-explaining resources.

Reddit.

a) Warm-up: A phase in the beginning of your neural network training where you start with a learning rate much smaller than your "initial" learning rate and then increase it over a few iterations or epochs until it reaches that "initial" learning rate.

Another nice explanation. This one also has an example code and graph.

Warmup is a method of warming up learning rate mentioned in ResNet paper. At the beginning of training, it uses a small learning rate to train some epoches or steps (for example, 4 epochs, 10000 steps), and then modifies it to the preset learning for training.

Now, carefully read this one from Stack Overflow:

A training step is one gradient update. In one step batch_size examples are processed. An epoch consists of one full cycle through the training data. This is usually many steps. As an example, if you have 2,000 images and use a batch size of 10 an epoch consists of:

2,000 images / (10 images / step) = 200 steps.

desertnaut
  • 1,908
  • 2
  • 13
  • 23
Mr. Panda
  • 131
  • 4
  • So the steps mean the number of batches or optimization steps, not the number of epochs? – CyberPlayerOne Jun 16 '23 at 13:28
  • @CyberPlayerOne that is correct. A training step refers to processing one batch of images in that example and updating the model's parameters based on the calculated gradients. – Mr. Panda Jun 17 '23 at 22:23
  • Thanks for answering. I'm still confused though. Because in PyTorch's doc, it states "`torch.optim.lr_scheduler` provides several methods to adjust the learning rate based on the number of **epochs**." (https://pytorch.org/docs/stable/optim.html). Maybe both training batches or epochs can work for LR scheduler to adjust LR in them? – CyberPlayerOne Jun 18 '23 at 06:53