How does Gradient Descent work?

Question

I know the calculus and the famous hill and valley analogy (so to say) of gradient descent. However, I find the update rule of the weights and biases quite terrible. Let's say we have a couple of parameters, one weight 'w' and one bias 'b'. Using SGD, we can update both w and b after the evaluation of each mini-batch. If the size of the mini-batch is 1, we give way to online learning. What if I do not want to use any of these methods and simply want to use "Gradient descent" in its entirety? What is the update rule in that case? To be more precise; at what step does w and b get updated? And at what step do we stop?

That said, the elephant in the room is the initial value of w and b. What is the parameter for choosing the first values of w and b?

score 3 · Accepted Answer · answered Sep 26 '21 at 20:14

Suppose you have a strictly convex function $f(x)$ that you'd like to minimize then to do using gradient descent you keep applying $$x_{i+1} = x_{i}-\lambda\frac{\partial f}{\partial x}$$ until convergence; that is when $x_i$ is very weekly changing or not changing at all because that implies that ${\partial f}/{\partial x}$ is zero or very close to zero in that neighborhood which further mathematically implies that you've reached the minimum. The same applies if $f$ was rather a function in many variables the gradient descent rule applies for each of them.

Now in data science $f$ can be a function in many variables that also involves a sum, for instance $$f(\theta_1,\theta_0)=\sum _{i=1}^m(y_{i}-(\theta _1^{\:}x_i+\theta \:_0))^2$$ where$x_i$ and $y_i$ are drawn from some dataset of length $m$.

In that case ${\partial f}/{\partial \theta_1}$ and ${\partial f}/{\partial \theta_0}$ are also going to involve the sum from $i=1$ to $i=m$; that is, to do a single update step you need to load the entire dataset in memory because you need to compute the derivatives. An alternative formulation that can be shown to be faster while also avoiding this issue (because it can be unfeasible to load the entire data set) uses only a subset of the dataset for each step, that subset can be even like you said just one example from the dataset.

So to answer your questions:

1 - You can use "Gradient Descent" in its entirety by considering the whole dataset for each iteration.

2 - You can always derive the update rule yourself by differentiating with respect to each of the parameters. If you see sums over the whole dataset leave them there so you can use Gradient Descent in its entirety.

3 - Once you compute the partial derivatives, you plug in the iterative scheme and that's when they get updated. Again, to compute the partial derivatives you might need to consider the whole dataset if you're using Gradient Descent in its entirety, also known as Batch Gradient Descent.

4 - You stop updating the weights whenever you believe that the loss function has reached the minimum. But because this might sometimes cause use to overfit the data if you have many parameters, you might stop whenever your model has reasonable accuracy on the validation set. I suggest that you read about early stopping.

5 - I can't see how this is "the elephant in the room" given how it isn't so relevant to the rest of the questions; however, like other iterative schemes used in optimization you start with random values for your parameters and the gradient should lead you to the minimum. Irregardless, in some scenarios, there do exist methods that help you start with better random guesses.

Thank you so much Essam, this is exactly what my doubt was about. Could you please also tell me how do we choose a more equitable initial value for the weights and biases? — Ritik P. Nayak, Oct 06 '21 at 22:33
Random initialization through the standard normal distribution should work. For more sophisticated ways read about He or Xavier initialization. Remember you can accept the answer by clicking on the checkmark. — Essam, Oct 08 '21 at 04:44

score 0 · Answer 2 · answered Sep 27 '21 at 09:10

I am not sure what you mean by "gradient descent in its entirety" but I assume you mean using the entire data in each epoch. In general, you have 3 different ways to "batch" your data. You can update the variables after training each sample, which is Stochastic Gradient Descent. If you do the update after training some number of data (smaller than the size of your entire data) - it is called "mini-batch gradient descent" regardless the batch size. And finally you can use your entire dataset before the update, and this is called "batch gradient descent"; but usually, as the entire dataset is almost never used for training (it might be impossible to fit the entire dataset in the memory), so mini-batch gradient descent is called batch gradient descent.

If you use your entire dataset for training, update is made after going through all the data you have. This is called one "epoch". Update is done at the end of the epoch in this case.

As for the initialization, you are correct to point out there is a problem. If the function we minimize was convex, it would not matter what we choose for initial values, as gradient descent would get us to the minimum no matter what. But as the dimensions of the model increase, it is extremely unlikely that we have a convex loss function. And in this case, initialization of the weight depends on the activation functions used in the model. As discussed in here, weights can be initialized such that w's are chosen to be random but small numbers, whereas b's are set to zero. For a more detailed discussion of initialization of neural networks, and a brief summary of the Xavier initialization, check this article, where you can also test various initialization methods against MNIST data to see the difference.

Thank you Serali for telling me something new. Nonetheless, my question was answered in its entirety by Essam :) — Ritik P. Nayak, Oct 06 '21 at 22:34

How does Gradient Descent work?

2 Answers2