20

I am using a convolution neural network ,CNN. At a specific epoch, I only save the best CNN model weights based on improved validation accuracy over previous epochs.

Does increasing the number of epochs also increase over-fitting for CNNs and deep learning in general?

Ethan
  • 1,625
  • 8
  • 23
  • 39
user121
  • 369
  • 1
  • 3
  • 9

2 Answers2

20

Yes, it may. In machine-learning there is an approach called early stop. In that approach you plot the error rate on training and validation data. The horizontal axis is the number of epochs and the vertical axis is the error rate. You should stop training when the error rate of validation data is minimum. Consequently if you increase the number of epochs, you will have an over-fitted model.

In deep-learning era, it is not so much customary to have early stop. There are different reasons for that but one of them is that deep-learning approaches need so much data and plotting the mentioned graph would be so much wavy because these approach use stochastic-gradient-like optimizations. In deep-learning again you may have an over-fitted model if you train so much on the training data. To deal with this problem, another approaches are used for avoiding the problem. Adding noise to different parts of models, like drop out or somehow batch normalization with a moderated batch size, help these learning algorithms not to over-fit even after so many epochs.

In general too many epochs may cause your model to over-fit the training data. It means that your model does not learn the data, it memorizes the data. You have to find the accuracy of validation data for each epoch or maybe iteration to investigate whether it over-fits or not.

gia huy
  • 103
  • 3
Green Falcon
  • 13,868
  • 9
  • 55
  • 98
  • I have a question. I was trying to understand overfitting in this case in terms of Runge's phenomenon in polynomial fit. For the latter, we always assume that fit is the best possible polynomial interpolant, and higher-order ones (with a uniform grid) lead to the undesired oscillation (overfitting). Now, does a high number of epochs gives us closer to the real minimum of the loss function in its parameter space, in the sense that the problem is about the model itself rather than attaining its real minimum? I am kinda new to the topic, so sorry for a potentially stupid question. – gamebm Oct 24 '22 at 02:08
  • 1
    @gamebm THe main reason we will have overfitting for a large number of epochs is that the weights will be large, and they will not have their initial small values. I mean this phenomenon is due to the fact that the weights will be biased towards values that are not mid values. – Green Falcon Oct 24 '22 at 06:29
  • Thanks for the explanation. To my understanding, the weights are the matrix elements in the convolution filter, so they are essentially the model parameters in the linear section (of a nonlinear model where the nonlinear part is the crucial difference if the Runge phenomenon analogy is not appropriate). – gamebm Oct 24 '22 at 11:17
  • In other words, from an ML perspective, it is actually not favorable to attain the real global minimum of the loss function by exhaustively pushing the model to its extremum due to the potential "overfitting." While from the viewpoint of the Runge phenomenon, such an extremum (i.e., square minimum) for the polynomial fit was assumed in the first place. Please correct me if I misunderstand. Again, many thanks! – gamebm Oct 24 '22 at 11:18
  • 1
    @gamebm The point is that when we add terms like L1 to the cost function, it would be nice to find the global extremum because we consider regularisation in the cost function. The problem is that the cost is not convex for multi-layer networks. Consequently, we cannot find it. I mean in this case, it would be nice to find the global extremum, but we do not have any tool for that. – Green Falcon Oct 26 '22 at 06:08
  • Let me reiterate to confirm that I got it correctly now. For a loss function without any regularization, as the number of epoches increases, the weight might become too large, so overfitting takes place. In this case, one gets closer to the minimum of the loss function (except the loss function in question is not adequately designed). With the proper regularization (though "proper" might not be straightforward to achieve), we indeed strive for the global minimum (except it is never easy to do so for a complex network). Am I getting your point correctly? Thanks again! – gamebm Oct 26 '22 at 11:07
  • 1
    @gamebm Sorry, I tried to consider solely your previous comment, not the entire content. For all approaches, whether you use regularisation or not, if you increase the number of epochs, the weights will be larger than the initial steps. If you add regularisation, the magnitude of the weights will be controlled, but the weights can still be large if you increase the number of epochs. On the other hand, the global minimum is not something that we can find for non-convex optimisation, the thing we have in deep learning. – Green Falcon Oct 27 '22 at 10:52
  • 1
    The other point is that I guess you're confused by two distinct terms. The global minimum does not have any relation to the number of epochs. If you have a simple regression problem with regularisation, you can have a cost function with a convex shape. So, you can find it without overfitting. I mean you can find the global min, and you may not have overfitting. Simply imagine some points that have linear behavior, and you're asked to fit a line. Overfitting depends on different aspects of the model as well. For the mentioned problem, if you use a multi-layer network, you'll not have – Green Falcon Oct 27 '22 at 10:56
  • 1
    a convex shape for your cost, and if you have an exaggerated regularisation to really avoid large weights, you will not have overfitting, but underfitting may be possible. I really tried to convey what I had in my mind :) – Green Falcon Oct 27 '22 at 10:58
  • 1
    I think I get your point, thanks! – gamebm Oct 27 '22 at 14:55
2

YES. Increasing number of epochs over-fits the CNN model. This happens because of lack of train data or model is too complex with millions of parameters. To handle this situation the options are

  1. we need to come-up with a simple model with less number of parameters to learn
  2. add more data by augmentation
  3. add noise to dense or convolution layers
  4. add drop-out layers
  5. add l1 or l2 regularizers
  6. add early stopping
  7. check the model accuracy on validation data
  8. early stopping will tell you appropriate epochs without overfitting the model