Comparison between cost functions to determine the "best" model?

Question

I'm building an LSTM neural net for time series prediction (regression) and I am incorporating custom loss functions into training. I'm trying to determine which cost function (of 3 cost functions) gives the "best" model, in other words, trying to define what "best" means.

The 3 cost functions yield results that are all on different scales, and in addition 2 produce positive numbers while the last produces negative numbers.

I have 5 datasets and I am training one model per dataset. With 3 losses, this gives me 3 * 5 = 15 losses in total. On validation data, the results look something like the below.

         | loss1 | loss2 | loss3
==================================
data1    | 1.106 | 5.074 | -1.872
==================================
data2    | 1.067 | 2.390 | -1.903
==================================
data3    | 0.823 | 4.724 | -1.892
==================================
data4    | 1.157 | 4.809 | -2.233
==================================
data5    | 0.583 | 2.854 | -2.120
==================================
Average  |   x   |   x   |   x
==================================

My goal is to somehow compare how effective a model trained with loss1 is compared to a model trained with loss2 on predicting out of sample data.

One procedure I have tried is to standardise each loss: (loss - mean(loss)) / std(loss) across datasets, then take an average across datasets for each loss and check which is smallest. However with a small sample size of 5, I don't know if this is valid. The average, after standardizing, could be a simple average, or a geometric average, or a harmonic mean. If I apply this method to the table above, I get these results

                   |  loss1   |  loss2   |  loss3
========================================================
Simple Average     | -3.9e-16 | -3.9e-16 | -9.7e-16
========================================================
Geometric Average  | 0.82     | 0.93     | 0.90
========================================================
Harmonic Average   | 2.69     | 2.36     | 2.55
========================================================

We can see that depending on the type of average chosen, a different loss is considered best. For simple average loss3 is best, for geometric loss1 is best, and for harmonic loss2 is best... sort of confusing.

Is this a valid way to compare model performance across different loss functions, or is there another method that would be more suitable to determine the "best" model?

An alternative would be to look at an accuracy metric across models (the same accuracy metric for all losses and datasets) and check which loss yields the highest accuracy.

"One procedure I have tried is to standardise each loss: (loss - mean(loss)) / std(loss) across datasets" Are the mean and std calculated for a particular loss function, or across all five values? — Acccumulation, Dec 08 '20 at 22:44

score 4 · Answer 1 · answered Dec 08 '20 at 15:43

tl;dr since you are referring to accuracy, I'm guessing you have a classification task.

There is no way to evaluate a model's actual performance, by its loss. The goal of the loss function is to train the model and not to show how well this model classifies.

What you should do to see the best model is to evaluate it by a classification metric (e.g. accuracy). You should note that it is best to measure these on the validation set to avoid overfitting.

Some things to consider...

Different loss functions, don't measure the same thing. For example take into consideration MAE and MSE. MSE penalizes larger errors disproportionately, compared to MAE. Which is the best? Depends on the problem and what you want!
Classification depends on your model's predictions, which are by their nature continuous. However, when we are referring to the "best model" we want the one that classifies better. This can cause misleading effects, because a model can reduce its loss by becoming more "confident" in the samples it classifies as correct (usually due to overfitting).

You can look at the following as an example of this:

label | model1 | model2 
  0   | 0.995  |  0.5
  0   | 0.400  |  0.5
  1   | 0.600  |  0.5
  1   | 0.600  |  0.5

model1 is a better classifier ($3/4$ correct predictions), but model2 ($2/4$) would have the smallest loss*.

*technically this depends on the loss function, but this stands on MAE, MSE and cross entropy losses.

You hit the nail on the head with the reference to accuracy. My problem is a regerssion problem (I will edit the question to show this). I guess if I choose an accuracy metric (maybe something even as simple as MSE^-1 where bigger is better) then I can decide which cost function yields the highest accuracy — PyRsquared, Dec 08 '20 at 17:36
@PyRsquared if you've got a single response variable, may I suggest simply making a plot of the predictions versus the actual response? You would do this once for each of the 3 loss functions. For more on predicted vs actual plots: https://stats.stackexchange.com/questions/104622/what-does-an-actual-vs-fitted-graph-tell-us — John Madden, Dec 08 '20 at 21:19

Comparison between cost functions to determine the "best" model?

1 Answers1