Weights not converging while cost function has converged in neural networks

Question

My cost/loss function drops drastically and approaches 0, which looks a sign of convergence. But the weights are still changing in a visible way, a lot faster than the cost function. Should I ensure the weights converge too?

Some details: I just calculated for 1 epoch. My loss function is mean square difference. I use no optimizer. I tried several experiments and different initial weights but all got converged loss function with ever-changing weights.

What's the order of magnitude of change od weights and loss, respectively? What's the value of learning rate? What's the accuracy on train set and validation set? All these questions could help ... — Antonio Jurić, Feb 18 '19 at 15:48
cost function drops exponentially, while weights change quite fast. if i run it forever, the weights will keep changing while the cost function is constant. im astonished to c the cost function changes so little while weights keep changing — feynman, Feb 19 '19 at 03:22
if learning rate too big, cost will change quickly, how will weights change so little — feynman, Feb 19 '19 at 06:30
Can you add in some more information such as loss vs epoch plot along with weights vs epoch plot to better look into the situation? — karthikeyan mg, Mar 09 '19 at 07:40

score 4 · Answer 1 · answered Mar 08 '19 at 16:25

The weights in a model do not need to converge to stop training.

One possible explanation is that the model error surface has a big, wide valley. If that is the case, the loss function would be low throughout the valley but there would be many weight combinations that would all yield similar performance on the training dataset. Once a model has reached an acceptable loss function value there is no reason to continue training, just take any set of weight values.

'The weights in a model do not need to converge to stop training' why? — feynman, Mar 09 '19 at 02:46

score 1 · Answer 2 · answered Mar 09 '19 at 21:36

1

Can you provide us with more info? What optimizer do you use and with what parameters, how many epochs and experiments did you run, what is your loss function?...

i just calculated for 1 epoch

This doesn't make any sense for conclussion you wrote in this post.

answered Mar 09 '19 at 21:36

sob3kx

31
3

i just calculated for 1 epoch. my loss function is mean square difference. i use no optimizer. i tried several experiments and different initial weights but all got converged loss function with ever-changing weights – feynman Mar 10 '19 at 03:58

score 1 · Answer 3 · answered Mar 10 '19 at 06:13

1

My cost/loss function drops drastically and approaches 0

When you didn't use any optimizer to optimize the loss as you have said, Technically it's not possible for the cost/loss function to drop drastically and approach zero. It's only because of the optimizer that the model works with the objective of reducing cost/error or in simpler terms from gradient descent hill analogy, optimizer finds"descending the hill in what way accounts for the most reduction in error". Your model just stays at the top of the hill forever!!!. The loss is just a number for your model.

Since there is no optimizer in your code, It's technically not possible that "cost/loss function drops drastically and approaches 0".Your model's loss stays in point B forever

But the weights are still changing in a visible way, a lot faster than the cost function

The above given are the update equations. Due to the random prediction of your model, At every batch, Some points tend to get predicted as correct class randomly. This accounts for some very small reduction in loss. And this change in loss is updated on the weights using the equation above. And so you may see random changes in weights for each batch. The overall effect of this change is negligible.

I've also made some real examples with mnist data which I computed without optimizer and the results are as follows:

Here you can clearly see the red line(loss) stays on top of the graph forever. I had a batch size of 5 and ran it for 5 epochs

answered Mar 10 '19 at 06:13

karthikeyan mg

868
8
22

i dont understand ur notion of 'optimizer'. my loss function drops because of the gradient descent equation. whenever dlost/dW isn't 0, lost will drop and W will update. Where does the 'optimizer' come from? – feynman Mar 11 '19 at 13:12
This is a cyclic process!!! Let me explain this. The job of the optimizer is to try and reduce the error made by our model. When there is no optimizer as in our case, the loss cannot be reduced. Loss is like a Straight line (red line in the last picture of my previous reply). And so, Weights doesn't move in the direction of convergence. Depending on the loss in current batch, weights move erratically. Moreover since the loss is a flat line, dLoss/dWeights(Measure of how sensitive the loss is w.r.t change in weights) is very small and hence changes in weights are also very small + random update – karthikeyan mg Mar 11 '19 at 14:34
So only when there is an optimizer in the model, will there be a reduction in loss and which will give a considerable dLoss/dWeights which inturn will go on and update the weights in the right direction and this goes on and on until it satisfies the convergence criteria!!! Hope it helps! – karthikeyan mg Mar 11 '19 at 14:38
by optimizer, do u mean a self adapting learning rate? if the learning rate is too small, it doesn't learn at all – feynman Mar 11 '19 at 15:12
No learning rate is different...Consider you are landing an aeroplane, Even though it can fly at 700+ km/h while landing we scaled the speed down like lets say 250km/h. Why?? Because it has wings which generate lift on high speed, This means you can never do a land at high speeds or you can say plane will overshoot the airstrip. In plane, we use airbrakes to reduce from 700 to 250, in ML, we use learning rate for that purpose else our model will never converge. – karthikeyan mg Mar 11 '19 at 15:39
yes if the learning rate is too small, it will take long to converge. This is like flying from NYC to SF at 10km/hr. you will reach, but you dont know when – karthikeyan mg Mar 11 '19 at 15:40
i use a sum of square of error as the loss function! then it's always convex, without any hills as drawn in the figure u provided. so in my opinion, the gradient descent should ensure the weights to drop whatsoever? – feynman Mar 12 '19 at 02:41

Alireza Zolanvari · Answer 4 · 2019-03-17T08:15:14.767

0

Convergence of the weights are not necessary (although in the case of converging the loss to a very low value, it can be a good news)

In fact, all learning methods are professional and fast search algorithms which search for the best answer for a problem using numerical methods in order to minimize the loss value. This means your problem maybe has not a unique answer. On the other hand the update rate of the weights is directly related to the learning rate value. So, by assuming what I say, the wights must change in every episode but may be in that domain of change they resulted into a same result.

This equation can have several answers.

edited Mar 17 '19 at 08:15

answered Mar 14 '19 at 14:24

Alireza Zolanvari

666
4
19

why: Convergence of the weights are not necessary, while loss function converged – feynman Mar 16 '19 at 02:36
Because your problem may have several answers – Alireza Zolanvari Mar 16 '19 at 04:39
if the loss function is mean squared error, the sum of quadratic functions is always quadratic. how can there b multiple answers – feynman Mar 16 '19 at 07:48
No! Your problem is not finding the answer of your loss function. You are trying to find an equation which transform the input of the algorithm to the desired target and the loss value presents that how much the estimated equation works well in finding the policy between the input and the its respective target. Therefore, there might be several equations which can estimate the target from the input. – Alireza Zolanvari Mar 16 '19 at 08:39
but im not interested in the policy/the process of getting into the target. im only into the target. if there's a unique target---loss function minimizer, there should b 1 answer of weights – feynman Mar 17 '19 at 07:50
No! A unique target can be the result of different sets of weights. According to the learning rate and also **momentum** the loss function minimizer can resonate between different answers of the weights. – Alireza Zolanvari Mar 17 '19 at 07:57
if a function has only 1 minimum, how can there b multiple minimizers – feynman Mar 17 '19 at 07:59
Your problem is that you consider the loss function independent. It is a function of the prediction function. Derivation of the loss function over weights may not have just one minimum. please check the answer update. – Alireza Zolanvari Mar 17 '19 at 08:13
note that i use full batch gradient descent rather than mini batch/stochastic. so my loss function descent is not random. it's a smooth descending exponential like curve – feynman Mar 17 '19 at 08:16

Weights not converging while cost function has converged in neural networks

4 Answers4