2

I have a binary classification model (Xgboost) that is supposed to be predicting whether a customer will be purchasing a service.

Overall the metrics are satisfactory ~.67 AUC, ~30% precision and ~40% recall @ max F1, performance holds well out of sample and out of time.

overall proportion of positive class is .13 (~13%)

However, there is something that is making me uneasy, the top 2700 scores (out of 150K) are 100% class 1, which may suggest some sort of a target data leak back into the features.

Is there a binomial test of some sort to check the likelihood of an abnormal situation?

Mouad_S
  • 121
  • 4
  • 1
    what do you mean by *top* when you say "the top 2700 scores"? It can't be the prediction scores because that would not make any sense.. – jdsurya Nov 08 '21 at 04:21
  • 1
    It looks like your instances are not randomized. You could indeed perform a significance test, but assuming the positive class has a low proportion (based on performance scores) it's clear that it's unlikely to happen by chance. – Erwan Nov 08 '21 at 21:07
  • @JdSuryaP yes that's what I mean, the prediction scores. – Mouad_S Nov 09 '21 at 14:11
  • @Erwan I made an edit to indicate that the overall rate of the positive class is 13%, – Mouad_S Nov 09 '21 at 14:14
  • 1
    Sorry I didn't read the question carefully the first time, it's not an issue about randomization. I'm going to share a few thoughts in an answer. – Erwan Nov 09 '21 at 16:51

1 Answers1

1

This looks totally normal to me.

At the end of the day, the predicted score (or probability) is supposed to represent the likelihood of an instance being positive, so one expects a proportion of positive instances as high as possible in the top predicted scores.

In particular the dataset may contain instances which are easy to classify correctly as positive, so the model logically captures such patterns and assigns a high score to the instances. This might even be caused by a few features for which some specific values directly imply a positive instance, but it doesn't necessarily mean that there is any data leakage: if this information is "legitimately" available in the features, there's no reason for the model not to use it. So the only question is whether the task was designed properly, but usually this cannot be deduced from the data.

Is there a binomial test of some sort to check the likelihood of an abnormal situation?

There are tests to check the likelihood of this happening by chance or not, but they wouldn't make sense in this case: by definition the scores predicted by the classifier are not random (at least they shouldn't), so significance tests would clearly reject the null hypothesis. This wouldn't prove any data leakage either, it's just exactly what we expect from the classifier. In other words, if the scores given by the classifier were truly random, then the classifier would not do its job and its performance would be terrible.

Erwan
  • 24,823
  • 3
  • 13
  • 34