11

Synthetic Minority Over-sampling Technique SMOTE, is a well known method to tackle imbalanced datasets. There are many papers with a lot of citations out-there claiming that it is used to boost accuracy in unbalanced data scenarios.

But then, when I see Kaggle competitions, it is rarely used, to the best of my knowledge there are no prize-winning Kaggle/ML competitions where it is used to achieve the best solution. Why SMOTE is not used in Kaggle?

I even see applied research papers (where there are millions of $ at stake) that SMOTE is not used: Practical Lessons from Predicting Clicks on Ads at Facebook

Is this because it's not the best strategy possible? Is it a research niche with no optimal real-life application? Is there any ML competition with a high-reward where this was used to achieve the best solution?

I guess I am just hesitant that creating synthetic data actually helps.

Carlos Mougan
  • 6,011
  • 2
  • 15
  • 45
  • 3
    [I contend that class imbalance isn't a problem and that artificial balancing approaches like SMOTE are not needed to solve a non-problem.](https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he) // [Frank Harrell, founding chairman of biostatistics at Vanderbilt, has tweeted about this.](https://twitter.com/f2harrell/status/1062424969366462473) – Dave Dec 27 '21 at 17:57
  • 2
    https://twitter.com/JFPuget/status/1475769513480179717 – Carlos Mougan Dec 29 '21 at 19:17

4 Answers4

8

After some debate on social networks and much asking, see the Twitter thread.

The best answer that I can find is that it does not work.

I would love to retract my answer and see a real-life example where it actually works (see tweet JFPuget ).

A recap of other sources and social media:

Carlos Mougan
  • 6,011
  • 2
  • 15
  • 45
  • 1
    [It seems that SMOTE might not even be any good at synthesizing new points!](https://stats.stackexchange.com/q/585173/247274) – Dave Nov 17 '22 at 21:59
  • 1
    Also see [The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression (van den Goorbergh et al. 2022)](https://pubmed.ncbi.nlm.nih.gov/35686364/) – Eike P. Aug 31 '23 at 09:03
  • I prefer this one from Abhishek: https://www.kaggle.com/competitions/amazon-employee-access-challenge/discussion/5086 – Lucas Morin Sep 01 '23 at 19:58
3

SMOTE improves prediction performance in a few limited situations. However, generally, it does not work. See https://arxiv.org/abs/2201.08528

Yotam
  • 31
  • 1
2

Somehow I missed this question... I can answer this as I have spent a great part of the last years trying to deal with imbalance inside and outside of Kaggle.

Why SMOTE doesn't work?

SMOTE creates syntectic data points, which is roughly equivalent to oversampling (if you take points A and B, create (A+B)/2, it is roughly equivalent to weighting A and B by 1.5). The added complexity of the generation process doesn't seems to bring consistent performance.

I even think there might be some sort of 'no free lunch theorem' here: as you don't add or remove information, sometimes the new points are helpful, sometimes they are counter-productive, and the average impact should be 0.

Now the question is: does oversampling work ? Well it seems we have a decent answer over here: https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he (roughly: No). I think that generally answer the question of the usefullness of SMOTE.

There are some times where undersampling/oversampling is actually needed for engineering constraints (undersampling reduces data size, and hence memory usage; oversampling ensures you have positives exemples in each mini-batches for NN, and it seems to accelerate convergence)

Why is it popular despite not being used on Kaggle?

Regarding Kaggle, I remember reading a winning solution mentioning SMOTE and that it only has marginal effect. I can't manage to find the discussion post. It might have been a minor / custom comp. I recently checked the blog posts and there is indeed no mention of SMOTE. But, why has it gained so much traction despite not being effective on Kaggle ?

  • Bad research practices (a.k.a. 'publish or perish') that only lead to diffusion of apparent increment on this solution, marginally new techniques, non-robust increment in performance ... etc. despite any application on any real life data set. Some of those practices can be found in business too.
  • Bad influencer practices (you know those linkedin people sharing half baked Tds article), that are led by visibility instead of quality. Unfortunately, this sort of behavior is also present on Kaggle forums (enven encouraged by the medal system).
  • Bad interviewing practices: somehow it has become an interview question, it appears on interview question lists. Now you have both non-technical people asking about it and young DS learning to answer 'SMOTE' when they are asked about data imbalance. A context were truth matters less than confidence.
  • Bad ML practices: Using wrong metrics and poorly designed cv it is very easy to gain performance as SMOTE is very leaky (it get easier to predict B from A when you add (A+B)/2 in your fold). It is then easier to publish with a gap up in performance and put another coin in the hype machine.
  • Bad evaluation practice: even with a good metric if you add a bit of noise in your data you have roughly 50% chance to improve the performance. It is often really easy (often incentivised) to be lead by non-robust evaluation, both inside and outside of Kaggle (business and research). It is quite rare to see discussion of robust evaluation of performance increment, even on kaggle.

Last but not least: Kagglers know a way better alternative. It should not be a surprise but Kaggle is very into gbdt that appears to be quite effective at handling imbalance. Notably each leaf of each tree being a set of exemple, you usually have a good probability calibration. A little bit of regularisation and you are good to go.

One major caveat to all of this: tabular comp. seems to get rarer over time (outside of TPS), and (maybe because they figured it is not a problem) there wasn't a significant imbalanced one in a while.

Cherry on top: you might want to see this Euroscipy Video from last week. G. Lemaitre from sklearn shows how to use it on an imbalanced dataset. It includes a demonstration of how SMOTE doesn't work. He also mention the package Imbalanced-Learn, that, as one of the creators he advise not to use (putting ADASYN and Tomek links in the same garbage bin). He also ends up talking about the real problem: cost imbalance matters more for business than target imbalance.

Lucas Morin
  • 2,513
  • 5
  • 19
  • 39
1

I think this is a highly interesting topic, which has been around for a long time with, as we can see, no clear conclusion. As far as I know from applied experience, I would use under/over sampling techniques when:

[EDIT]

  • we know in advance that our dataset has unrealistic ratios between positive and negtive target labels; this way, with under/over sampling we could balance towards a more realistic ratio

  • our dataset has incorrect data samples, so we need to filter out (e.g. by undersampling) wrong data points (systematic errors when retrieving the raw data) which we will likely not find in a real inference scenario

  • problems where the positives are known to happen (like fraud detection) but we still did not have time enough to collect such positive samples enough, so oversampling could be interesting

Those cases aim to correct an unreliable training dataset, to get it more similiar to the real scenario. I think this is the reason to not apply, in real cases (or simply when the datasewt is correctly built) under/over sampling techniques. I understand in Kaggle the datasets are already well designed and aimed directly for modeling. So, in case the imbalanced dataset represents the real problem distribution, is the algorithm's responsibility to capture the pattern of the data as is.

German C M
  • 2,674
  • 4
  • 18
  • 1
    Even if I could agree with your answer. I don't see how this answer the SMOTE question – Carlos Mougan Dec 27 '21 at 18:04
  • I dont think the words "incorrect" or "unrealistic" capture the message you are trying to share – Carlos Mougan Dec 27 '21 at 18:04
  • unrealistic ratio = does not represent the real ratio you will find in the long term for inference scenario / incorrect = due to data collection process with some systematic error – German C M Dec 27 '21 at 18:36
  • "I understand in Kaggle the datasets are already well designed and aimed directly for modeling" Thats simply not true. Often the datasets come with all kind of issues, and addressing those issues is how you win a competition. – Christof Henkel Aug 31 '23 at 09:14