Questions tagged [learning-rate]

43 questions
8
votes
1 answer

When should you use learning rate scheduling over adaptive learning rate optimization algorithm?

In order to converge to the optimum properly, there have been invented different algorithms that use adaptive learning rate, such as AdaGrad, Adam, and RMSProp. On the other hand, there is a learning rate scheduler such as power scheduling and…
Blaszard
  • 901
  • 1
  • 13
  • 29
7
votes
1 answer

Is it a good practice to always apply `ReduceLROnPlateau()`, given that models benefit from reducing learning rate once learning stagnates?

The rationale behind the keras function ReduceLROnPlateau() is that models benefit from reducing learning rate once learning stagnates. Is it a good practice to always apply ReduceLROnPlateau()? What are some situations, if any, to not apply…
user781486
  • 1,305
  • 2
  • 16
  • 18
4
votes
1 answer

Need to kickstart learning rates

I was just looking at the docs on Pytorch for different available schedulers and I found one that I am having some trouble understanding here. The others seem to make sense: As training progresses, the learning rate gradually decreases. But in my…
4
votes
1 answer

Should a Learning Rate Scheduler adjust the learning rate by optimization step (batch) or by epoch?

In PyTorch doc, it suggests torch.optim.lr_scheduler provides several methods to adjust the learning rate based on the number of epochs. However, from other sources it looks like the learning rate should be adjusted in every optimization step…
3
votes
3 answers

Why a sign of gradient (plus or minus) is not enough for finding a steepest ascend?

Consider a simple 1-D function $y = x^2$ to find a maximum with the gradient ascent method. If we start in point 3 on x-axis: $$ \frac{\partial f}{\partial x} \biggr\rvert_{x=3} = 2x \biggr\rvert_{x=3} = 6 $$ This means that a direction in which…
3
votes
1 answer

Which learning rate should I choose?

I'm training a segmentation model, Unet++, on 2d images and I am now trying to find the optimal learning rate. The backbone of the model is Resnet34, I use Adam optimizer and the loss function is the dice loss function. Also, I use a few…
3
votes
2 answers

Scikit learn linear regression - learning rate and epoch adjustment

I am trying to learn linear regression using ordinary least squares and gradient descent from scratch. I read the documentation for the Scikit learn function and I do not see a means to adjust the learning rate or the epoch with the…
chrisper
  • 53
  • 1
  • 6
3
votes
1 answer

Learning rate Scheduler

A very important aspect in deep learning is the learning rate. Can someone tell me, how to initialize the lr and how to choose the decaying rate. I'm sure there are valuable pointers that some experienced people in the community can share with…
user
  • 61
  • 3
2
votes
1 answer

Intuition behind Adagrad optimization

The following paper ADADELTA: AN ADAPTIVE LEARNING RATE METHOD gives a method called Adagrad where we we have the following update rule : $$ X_{n+1} = X_n -[Lr/\sqrt{\sum_{i=0}^ng_i^2}]*g_n $$ Now I understand that this updation rule dynamically…
2
votes
2 answers

Constant Learning Rate for Gradient Decent

Given, we have a learning rate, $\alpha_n$ for the $n^{th}$ step of the gradient descent process. What would be the impact of using a constant value for $\alpha_n$ in gradient descent?
Umbrage
  • 23
  • 2
2
votes
1 answer

Is it necessary to tune the step size, when using Adam?

The Adam optimizer has four main hyperparameters. For example, looking at the Keras interface, we have: keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False) The first hyperparameter is called step size…
2
votes
1 answer

Constant validation loss & accuracy, training accuracy fluctuates

I am training a Squeeze-net model for binary classification of images. I have 79968 images for training (50:50 for and against) and 8892 images in the validation set. After 35000 iterations my training accuracy fluctuates between 1 and 0.96875. The…
2
votes
1 answer

Why are optimization algorithms slower at critical points?

I just found the animation below from Alec Radford's presentation: As visible, all algorithms are considerably slowed down at saddle point (where derivative is 0) and quicken up once they get out of it. Regular SGD itself is simply stuck at the…
ShellRox
  • 389
  • 3
  • 12
1
vote
1 answer

Is there a relationship between learning rate and training set size?

I have a large dataset to use for training a Neural Network model. However, I don't have enough resources to do a proper hyperparameters tuning on the whole dataset. Therefore, my idea is to tune the learning rate on the subset of data (let's say…
jakes
  • 95
  • 12
1
vote
0 answers

Why does the learning rate influence whether i get a error from BCE or not?

When I use a learning rate higher than 0.001, I get this: Assertion `input_val >= zero && input_val <= one` failed. This means that the input I gave to BCE is above 1 or below 0 right? Why does changing the learning rate cause this error? Also, I…
SamuelS
  • 11
  • 1
1
2 3