I'm building an LSTM neural net for time series prediction (regression) and I am incorporating custom loss functions into training. I'm trying to determine which cost function (of 3 cost functions) gives the "best" model, in other words, trying to define what "best" means.
The 3 cost functions yield results that are all on different scales, and in addition 2 produce positive numbers while the last produces negative numbers.
I have 5 datasets and I am training one model per dataset. With 3 losses, this gives me 3 * 5 = 15 losses in total. On validation data, the results look something like the below.
| loss1 | loss2 | loss3
==================================
data1 | 1.106 | 5.074 | -1.872
==================================
data2 | 1.067 | 2.390 | -1.903
==================================
data3 | 0.823 | 4.724 | -1.892
==================================
data4 | 1.157 | 4.809 | -2.233
==================================
data5 | 0.583 | 2.854 | -2.120
==================================
Average | x | x | x
==================================
My goal is to somehow compare how effective a model trained with loss1 is compared to a model trained with loss2 on predicting out of sample data.
One procedure I have tried is to standardise each loss: (loss - mean(loss)) / std(loss) across datasets, then take an average across datasets for each loss and check which is smallest. However with a small sample size of 5, I don't know if this is valid. The average, after standardizing, could be a simple average, or a geometric average, or a harmonic mean. If I apply this method to the table above, I get these results
| loss1 | loss2 | loss3
========================================================
Simple Average | -3.9e-16 | -3.9e-16 | -9.7e-16
========================================================
Geometric Average | 0.82 | 0.93 | 0.90
========================================================
Harmonic Average | 2.69 | 2.36 | 2.55
========================================================
We can see that depending on the type of average chosen, a different loss is considered best. For simple average loss3 is best, for geometric loss1 is best, and for harmonic loss2 is best... sort of confusing.
Is this a valid way to compare model performance across different loss functions, or is there another method that would be more suitable to determine the "best" model?
An alternative would be to look at an accuracy metric across models (the same accuracy metric for all losses and datasets) and check which loss yields the highest accuracy.