0

Why is the cross entropy loss for all training examples(or the training examples in a batch) averaged over size of the training set(or batch size) ?

Why is it not just summed and used ?

1 Answers1

1

Sum depends on the number of data points, obviously. It is still valid and often used (e.g. when comparable scales in a custom compound loss are needed), assuming the most popular implementations of minibatch learning. The main benefit of averaging is bringing the loss to a uniform scale. This allows:

  • loss interpretation, to a certain extent;
  • easy evaluation of models trained with different hyperparameters against each other;
  • the last batch of the training set, which might be shorter than others, having the same contribution to loss;
  • optimal regularization amount and learning rate being much less dependent on batch size.
dx2-66
  • 666
  • 2
  • 11