4

Suppose I have a binary classifier $f$ which acts on an input $x$. Given a threshold $t$, the predicted binary output is defined as: $$ \widehat{y} = \begin{cases} 1, & f(x) \geq t \\ 0, & f(x) < t \end{cases} $$ I then compute the $TPR$ (true positive rate) and $FPR$ (false positive rate) metrics on the hold-out test set (call it $S_1$):

  • $TPR_{S_1} = \Pr(\widehat{y} = 1 | y = 1, S_1)$
  • $FPR_{S_1} = \Pr(\widehat{y} = 1 | y = 0, S_1)$

Next, say I deploy the classifier to production (i.e. acting on real-world data), and after two weeks I collect the results and have the data labelled (by a human, assume no errors in labelling). Call this data set $S_2$. I observe the following:

  • total samples during this period = $|S_2|$ = $N$ negative + $P$ positive
  • $TPR_{S_2} = \Pr(\widehat{y} = 1 | y = 1, S_2)$
  • $FPR_{S_2} = \Pr(\widehat{y} = 1 | y = 0, S_2)$

My question is this:

Under what conditions / assumptions I can meaningfully compare the target TPR, FPR (as computed on the hold-out set $S_1$) to the observed TPR, FPR (as computed on the production data set $S_2$)? Or, at the very least, is there a relation between TPR, FPR on $S_1$ and TPR, FPR on $S_2$, does it even make sense to compare them?

My intuition is that the input distributions in $S_1$ and $S_2$ should be similar, but I need some help formalizing this concept.

Any tips and literature suggestions are greatly appreciated!

German C M
  • 2,674
  • 4
  • 18

2 Answers2

2

Your points totally make sense, as it is key to monitor models performance in an MLOps strategy. About these points, I would say:

  • the metric you might want to monitor to measure your model performance is ROC AUC (which uses your TPR and FPR)
  • yes, it makes sense to compare this ROC AUC value between training phase (with your holdout set S1) VerSus new ROC AUC values for new predictions on production (of course once you have the true labels to validate)

About monitoring when a model should be retrained or might be underperforming:

  • checking that your input distributions are still considered to be the same between training and inference on production, there is a concept called data drift via Population Stability Index (link) or using statistical tests like Kolmogorov-Smirnov test; this Google link discussing data drift detection on univariate and multivariate inputs
  • it is also interesting to track a possible concept drift (i.e. drift on the target variable)
  • checking your metric value (ROC AUC in this case), and decide a threshold below which you consider your model is underperforming (e.g. is your productive model metric 10% worse compared to the value when it was trained?)
German C M
  • 2,674
  • 4
  • 18
  • "yes, it makes sense to compare this ROC AUC value between training phase VerSus new ROC AUC values for new predictions on production" -- is this generally true? if so, why? Aren't there any assumptions that need to be made? – Alexandru Dinu Apr 28 '22 at 10:42
  • the general rule to keep in mind to monitor your model performance is you need a metric (could be ROC AUC in this case) to evaluate if this is performing as you consider good enough. The metric degradation could be due to a change in the conditions of your population, something you could detect via data drift monitoring; also, the samples size should be enough to be compared – German C M Apr 28 '22 at 13:18
  • I understand that you need a metric. What I am asking is whether (or when) the metric computed on the test set is worth comparing to the metric computed on the real-world data. Example scenario: "FPR on real-world data is 4x higher than FPR on the test set." Does this automatically imply the model is _bad_? Maybe the class ratios differ, maybe some examples were less/over represented etc. One explanation may indeed be _concept drift_ (as you mention) -- how can this be further formalized? – Alexandru Dinu Apr 28 '22 at 13:48
  • I'll look further into **concept drift**. Thank you! – Alexandru Dinu Apr 28 '22 at 13:55
1

You can run A/B tests for TPRs and FPRs to test your statement that they come from the same distribution. The null hypothesis is that True Positives (and False Positives) obtained from S1 and S2 results come from the same distribution. A simple T-test could be a good start (or something like this Z-test if you have a lot of samples).

Aramakus
  • 186
  • 2
  • What's a "sample" in this context? I only have 2 TPRs and 2 FPRs (on S1 and S2), how can an A/B test be constructed? – Alexandru Dinu Apr 28 '22 at 10:39
  • Samples would be a number of Positives in S1 and S2. Next, assume that in each of the samples an observation is TP or FN as a result of a Bernoulli process with probabilities TPR1 and TPR2 respectively. You test for the plausibility of observed estimates TPR1 = TPR2 (what you observe are the estimates for these value coming from limited size samples). – Aramakus Apr 28 '22 at 10:48