2

Given, we have a learning rate, $\alpha_n$ for the $n^{th}$ step of the gradient descent process. What would be the impact of using a constant value for $\alpha_n$ in gradient descent?

Umbrage
  • 23
  • 2

2 Answers2

5

Intuitively, if $\alpha$ is too large you may "shoot over" your target and end up bouncing around the search space without converging. If $\alpha$ is too small your convergence will be slow and you could end up stuck on a plateau or a local minimum.

That's why most learning rate schemes start with somewhat larger learning rates for quick gains and then reduce the learning rate gradually.

oW_
  • 6,254
  • 4
  • 28
  • 45
1

Gradient descent has the following rule:

$\theta_{j} := \theta_{j} - \alpha \frac{\delta}{\delta \theta_{j}} J(\theta)$

Here $\theta_{j}$ is a parameter of your model, and $J$ is the cost/loss function. At each step the product $\alpha \frac{\delta}{\delta \theta_{j}} J(\theta)$ gets smaller as we get closer to the gradient $\frac{\delta}{\delta \theta_{j}} J(\theta)$ converging to 0. $\alpha$ can be constant, and in many cases, it is, but varying $\alpha$ might help converge faster.

Wes
  • 682
  • 4
  • 13