What techniques are used to analyze data drift?

Question

I've created a model that has recently started suffering from drift.

I believe the drift is due to changes in the dataset but I don't know how to show that quantitatively.

What techniques are typically used to analyze and explain model (data) drift?

Extra:

The data is Tabular.

Carlos Mougan · Accepted Answer · 2023-05-05T07:32:06.693

3

It depends about what type of data are we talking: tabular, image, text...

This is part of my PhD, so I am completely biased, I will suggest Explanation Shift. (I would love some feedback). It works well on tabular data.

Package: skshift https://skshift.readthedocs.io/
Paper: https://arxiv.org/pdf/2303.08081.pdf

In the related work section one can find other approaches.

The main idea under "Explanation Shift" is seeing how does distribution shift impact the model behaviour. By this we compare how the explanations (Shapley values) look on test set and on the supposed Out-Of-Distribution data.

The issue is that in the absence of the label of OOD data, (y_ood) one can not estimate the performance of the model. There is the need to either provide some samples of y_ood, or to characterize the type of shift. Since you can't calculate peformance metrics the second best is to understand how the model has changed.

There is a well known library Alibi https://github.com/SeldonIO/alibi-detect That has other methods :)

edited May 05 '23 at 07:32

answered Mar 17 '23 at 06:56

Carlos Mougan

6,011
2
15
45

Very cool! Thankyou, I'll read that today. Can you add a brief description of your explanation shift method. We're talking tabular data, I'll update the question! – Connor Mar 17 '23 at 07:44
I already have comments! Is to possible to add them to ArXiv? – Connor Mar 17 '23 at 07:52
1

How about a GitHub Issue? Thanks :) Is part of my research thank you! https://github.com/nobias-project/nobias – Carlos Mougan Mar 17 '23 at 08:11
1

If you want to know more about DataSet Shift you might want to check this book: https://cs.nyu.edu/~roweis/papers/invar-chapter.pdf – Carlos Mougan Mar 17 '23 at 08:12
1

Or this video https://www.youtube.com/watch?v=WhpZKIra-FQ&t=457s&ab_channel=ColumbiaDataScienceInstitute – Carlos Mougan Mar 17 '23 at 08:12

Brian Spiering · Answer 2 · 2023-04-04T12:19:44.950

3

One way to start is fundamental exploratory data analysis.

Compare univariate, bivariate, and multivariate distributions between training data and new data. Those comparisons can be done visually, qualitatively, and quantitatively.

The exact methods would depend on the data type of the features of the tabular dataset. One specific example is the K-L divergence between two continuous distributions.

edited Apr 04 '23 at 12:19

answered Mar 17 '23 at 13:32

Brian Spiering

20,142
2
25
102

What is the K-L divergence? What does it tell you? – Connor Mar 18 '23 at 08:33

score 2 · Answer 3 · answered Apr 07 '23 at 14:12

Choose the drift detection method. There are different statistical tests (e.g., Kolmogorov–Smirnov, Chi-square) and distance and divergence methods (e.g., Wasserstein distance, K-L divergence, Jensen-Shannon distance, Population Stability Index) that can be used to compare distributions on tabular data. Some methods are better for numerical, some for categorical features, and some work for both.

If you want to understand how different drift detection methods behave, here is a blog (with code) about the experiment when different drift detection methods were applied to artificially shifted datasets: https://www.evidentlyai.com/blog/data-drift-detection-large-datasets

Define comparison windows that match your use case and expected feature behavior: for example, you can compare your last-week data to the previous week, or all production data to validation data, or to some golden set, etc.
Evaluate prediction drift. It makes sense to separately look at the distribution shift in the model predictions (output drift) as it is often a great indicator that something has changed, e.g., it is predicting certain categories more often, etc.
Evaluate feature drift. You can perform per-feature drift detection and then measure the % of the drifted features and look at the drifted ones to visually explore/interpret what changed. Or only test top model features for drift.

You can use open-source libraries like Evidently https://github.com/evidentlyai/evidently that implement many drift detection methods and can quickly visualize distributions.

What techniques are used to analyze data drift?

3 Answers3