High accuracy on test-set, what could go wrong?

Question

You are given a pre-trained binary ML classification model with 99% accuracy on the test-set (assume the customer required 95% and that the test-set is balanced). We would like to deploy our model in production. What could go wrong? How would you check it?

My answer was that the model could be biased towards our test set and could fail to generalize on yet to be seen data. We can check this by running the model on multiple unrelated test-set that it haven't seen before.

Is this the right angle?

score 12 · Accepted Answer · answered Dec 17 '20 at 12:58

There are many possible ways to make errors giving you a really huge score on test in Data Science. Here are a few examples :

Your test set is also in your train set : Imagine you have data from 2010 to 2020, and use 2019 as test. If you trained on all 2010-2020 without excluding 2019, then you'll test on a data well known by the model, since it was used to train it. Moreover, if the model tends to overfit (so fits "too perfectly and precisely" with training set), you could achieve a 99% accuracy
Data leakage : This is a phenomenom in which your test set contains info that you shouldn't have in real new cases. Example : you're using Titanic dataset, predicting who dies and who survives. Imagine now the dataset has an attribute called "Hour of death", empty if the person survived, and filled with an hour if he died. Then your model will just learn "if this attribute is empty then the person survive, else he died". On your test set, you'll apply your model, knowing this info that you shouldn't know if you had to predict true new cases

Wathever happens, a 99% accuracy have to make you wonder and look for errors, this is almost impossible to achieve unless your problem is REALLY easy (and might not need a Data Science model at all)

I would add a couple of other points. One is that a very small test set may have been used, which just happens to have a distribution very close to the training set and doesn't represent the true generalisability (which is similar to OP's response). Using CV for validation, to maximise the data available for test, might help. Also, model drift – having some metrics/KPIs to monitor performance. With accuracy that high, it really has only one way to go and it could go that way very quickly. — Chris, Dec 17 '20 at 14:55

score 5 · Answer 2 · answered Dec 17 '20 at 21:51

As already mentioned, data leakage and having some of the same data in both the test and training sets can be problematic.

Other things that can go wrong:

Concept drift

the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways

This can happen even if you do everything right when training and is especially relevant if you're training on old data.
"Seeing the future"

Say you have some sort of time series, where current data and patterns may inform future data (like stock prices or customer behaviour). If you trained on data that is in the future relative to the test set, this doesn't reflect the real world (as you can't see the future in the real world) and the model test performance may be inaccurate.

The examples from the test set need to be after every example from the training set.

This is related to concept drift (and may or may not classify as that).
Optimising on the test set

If you're doing hyperparameter tuning, you should use a separate set (a cross-validation set) for that.

Optimising your hyperparameters based on your test set means you're basically training on your test set and then you can't really trust the performance of your model for the same reason you can't trust a model you tested on your training set.
Wrong metric?

I assume this may not be relevant based on the way the question was phrased, but it's important to mention nonetheless.

If you didn't take appropriate care when selecting your evaluation metric, you may have a model that looks good on paper, but would have quite a bad impact on the business if it actually got those same results in production.

There are many, many ways you can pick the wrong metric, but the general idea to picking the right metric is that you need to consider the real-world effects of various ways in which the model can be inaccurate (false positives and false negatives for binary classification) and what the side effects of this may be (e.g. rejecting a valid customer costs means they're much less likely to come back and may convince others to go elsewhere instead).

Don't just use some popular metric out of the box without fully understanding the implications of using that metric.

How can you mitigate the risks?

This is a large topic all by itself, but I can briefly address it here.

Apart from what I already mentioned above, you can do:

Incremental roll-out

For example, you can start running the model in a few cities rather than rolling it out worldwide right away.

This limits the risk in case the model turns out to not perform so well in the real world.
A/B testing

This allows you to accurately evaluate a model in a way you may not otherwise be able to (e.g. if you reject a transaction flagged as fraud, you may not ever know whether it actually was fraud).
Appropriate monitoring of model performance

Keep an eye on how the model is performing in production using various metrics and make sure there isn't any unforeseen side effects.
Regular retraining of the model

This is most relevant for concept drift. Generally you would expect the model performance to decrease over time.

High accuracy on test-set, what could go wrong?

2 Answers2

How can you mitigate the risks?