Reproducible examples where balancing the training data demonstrably improves accuracy

Question

I asked this question on the Statistics SE, but there were no answers, even when a modest bonus was available, so I am asking here to see if any examples can be given.

I have been looking into the imbalanced learning problem, where a classifier is often expected to be unduly biased in favour of the majority class. However, I am having difficulties identifying datasets where class imbalance is genuinely a problem and furthermore, where it is actually a problem, that it can be fixed by re-sampling (e.g. SMOTE) or re-weighting the data.

Can anyone give reproducible examples of real-world (preferably not synthetic) datasets where re-sampling or re-weighting can be used to improve the accuracy (or equivalently misclassification error rate) for some particular classifier system (when applied in accordance with best practice)? This must be an improvement in accuracy on the original data distribution, not the resampled one, as that reflects operational conditions where the classifier will be deployed.

I am only interested in accuracy as the performance measure. There are some tasks where accuracy is the quantity of interest in the application, so I would appreciate it if there were no digressions onto the topic of proper scoring rules, or other performance measures.

It is not an example of the class imbalance problem if the operational class frequencies are different to those in the training set or the misclassification costs are not equal. Cost-sensitive learning is a different issue.

UPDATE: While the answer that received the bounty was not ideal (as it didn't appear to apply the classifier in accordance with best practice), I may well give a new bounty to answers that more fully address the question.

It seems to me that your question implies that someone claimed class balancing improves accuracy in imbalanced setups. However, the usual claim is a different one: that accuracy misrepresents the performance of classification algorithms on imbalanced data and that it gives misleading expectations that can be harmful in some cases where the performance regarding the minority class is important for some reason. Just saying. — noe, Apr 18 '23 at 19:42
@noe accuracy doesn't misrepresent performance in imbalanced problems. The needs of the application determine the choice of performance metric and sometimes that is accuracy. The key there is to provide the correct context by also giving the accuracy of the classifier that assigns everything to the majority class or use Cohen's kappa, which is a rescaled accuracy. I think it is unlikley that such an example can be demonstrated, but am asking the question out of self-skepticism - I am happy to be proven wrong. — Dikran Marsupial, Apr 19 '23 at 06:42
Note if you can improve accuracy by rebalancing, you ought to be able to improve the empirical loss for unequal misclassification costs as resampling is one way of implementing cost-sensitive learning. A lot of questions on the stats SE seem to assume that class imbalance is an inherent problem in classification and that you are supposed to rebalance the data as part of the "pipeline". I think this is due to an incorrect perception that it improves performance, which I think is incorrect. — Dikran Marsupial, Apr 19 '23 at 06:44
I have one exemple where some kind of resampling was needed but it was about making sure each batch had one exemple of both classes for tuning neural nets. I empirically found that it accelerate training too, however carefull conclusions should include retuning all hyperparameters. — Lucas Morin, Apr 21 '23 at 13:16

score 2 · Answer 1 · answered Apr 21 '23 at 05:31

From my experience in real-world data, I have never seen consistent domains in which resample techniques improve the model's performance. Remember that one of the main assumptions of learning is that training data is the same system generation that test data, i.e. they both have the same distribution, which holds not true when applying resampling.

Instead, I would go for cost sensitive learning so that we penalise most of the minority class cases.

Im sharing an example of data in which SMOTE showed a slight increase across different metrics.

import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from lightgbm import LGBMClassifier
from sklearn.metrics import classification_report
import urllib.request


# Load the dataset from a URL
url = 'https://storage.googleapis.com/download.tensorflow.org/data/creditcard.csv'
filename = 'creditcard.csv'
urllib.request.urlretrieve(url, filename)

# Load the dataset into a Pandas dataframe
df = pd.read_csv(filename)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df.drop('Class', axis=1), df['Class'], test_size=0.3, random_state=42)

# Perform SMOTE oversampling on the training set
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Train a classifier on the resampled training set
rfc = LGBMClassifier(random_state=42).fit(X_train_resampled, y_train_resampled)

# Evaluate the classifier on the original testing set
y_pred = rfc.predict(X_test)
print(classification_report(y_test, y_pred))

# Train a classifier on the original training set
rfc = LGBMClassifier(random_state=42).fit(X_train, y_train)

# Evaluate the classifier on the original testing set
y_pred = rfc.predict(X_test)
print(classification_report(y_test, y_pred))

Hope it helps!

Thanks for the answer - I am definitely in agreement with the first two paragraphs! I tried the example, but for me it gives the same accuracy in both cases (1.00). — Dikran Marsupial, Apr 21 '23 at 07:06
@DikranMarsupial The accuracies are both very high and round to 1.00. Adding `print((y_test == y_pred).mean())` to each section gives `0.99898` for resampled and `0.99890` for original. Whether that's a random effect or not should be examined. — Ben Reiniger, Apr 21 '23 at 16:51
@BenReiniger I tried about a dozen random_state values and they all seemed to give a better accuracy for SMOTE, which is a very interesting result. Some give bigger differences (a value of 2 gives 0.9990 and 0.9932). It could be that SMOTE is acting as a regulariser by "blurring" the input data, I'll have to investigate LGBM more to see if there are some model tuning steps that are required. — Dikran Marsupial, Apr 21 '23 at 17:27
Ah, I see it has regularisation parameters which are not being used (their default value s zero), so it is questionable whether it is being used in accordance with best practice. — Dikran Marsupial, Apr 21 '23 at 17:28

score 0 · Answer 2 · answered Apr 20 '23 at 12:49

A well known example is the Breast Cancer Wisconsin Data Set with a target variable inbalance of 63% - 37% (on the version published on Kaggle). There is a plethora of research out there which uses things such as SMOTE to improve accuracy.

There are also a lot of Kaggle Notebooks which do the same, which you would be easily able to run yourself. Just looking through a couple, this notebook would be an example which shows how SMOTE improves the accuracy of XGB on this dataset. I have not verified the quality of this notebook.

Note that you stated you wanted a dataset ``where it is actually a problem". This is highly subjective. In this dataset, with the notebook linked, increasing accuracy with 3% seems significant enough for a healthcare application to deem my approval for showing that class imbalance is a problem. However, this is completely arbitrary.

AFAICS that example is incorrect as it looks as if it is evaluating the accuracy on the SMOTED dataset, not the original data distribution. I have updated the question to make that point clearer. — Dikran Marsupial, Apr 20 '23 at 14:03
Unfortunately there are a *lot* of incorrect examples available on-line, and I think that has contributed to a lot of the misunderstandings about imbalanced learning - I am asking this question to find out whether there is a really solid example that supports the idea that resampling improved accuracy. — Dikran Marsupial, Apr 20 '23 at 14:40

score 0 · Answer 3 · edited Apr 21 '23 at 15:34

0

One example of real-world imbalanced data is credit card fraud.

Here is code showing empirically better performance for SMOTE:

import imblearn
from imblearn.pipeline import make_pipeline
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import balanced_accuracy_score
from sklearn.model_selection import train_test_split

# Load data and split
data = pd.read_csv("creditcard.csv", header=1).values
X, y = data[:, :-1], data[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Without SMOTE
lr =  LogisticRegression(solver='liblinear', class_weight=None)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
print(balanced_accuracy_score(y_test, y_pred))

The balanced accuracy for non-SMOTE is ~ 0.774.

# With SMOTE
pipe = make_pipeline(imblearn.over_sampling.SMOTE(),
                     LogisticRegression(solver='liblinear', class_weight=None))
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(balanced_accuracy_score(y_test, y_pred))

The balanced accuracy for SMOTE is ~0.935.

Addendum:

It is often possible to maximize regular accuracy for an imbalanced dataset by always predicting the majority class (not fitting a machine learning model or resampling).

import numpy as np
from sklearn.metrics import accuracy_score

# Always predicting majority class
y_pred = np.zeros(len(y_test))
print(accuracy_score(y_test, y_pred))

The regular accuracy for always predicting the majority class is ~0.998.

edited Apr 21 '23 at 15:34

Ben Reiniger

11,094
3
16
53

answered Apr 20 '23 at 16:05

Brian Spiering

20,142
2
25
102

Hi, thanks for the answer, but that seems to be looking at balanced accuracy, rather than the accuracy on the original data distribution. – Dikran Marsupial Apr 20 '23 at 16:30
1

Balanced accuracy is more useful as an evaluation metric than regular accuracy for imbalanced data. If I want to maximize regular accuracy on the original data distribution, I can use a fixed model that only predicts the majority class (aka, not fraud). There is no need for resampling or machine learning if the goal is to maximize regular accuracy in extremely imbalanced data. – Brian Spiering Apr 20 '23 at 16:57
1

balanced accuracy may be a more useful metric for some problems, but that was not the question that was posed (it is essentially a cost-sensitive learning issue). The choice of performance metric depends on the requirements of the application, and for some problems, even where there is imbalance, accuracy is completely appropriate. I fully agree that imbalance is not a reason for resampling the data, but I see a lot of SE questions and blog posts that suggest not everybody agrees. I'm happy to be proven wrong! – Dikran Marsupial Apr 20 '23 at 17:06
Should always have baselines for comparison/context - predicting the majority class is a good one, if nothing else it is cheap! What is the accuracy for logistic regression in this case (I am a MATLAB person rather than python)? – Dikran Marsupial Apr 20 '23 at 18:20
1

header = 1 should remove the labels and allows for traing. On the theoretical side I switched to predict proba and auc as evaluation metric (to avoid accuracy and thresholding poblem) + random shuffle split cross validation and the result seems to hold. However I haven't completely ruled out time leakage... – Lucas Morin Apr 21 '23 at 13:35
@lcrmorin I want to focus on accuracy because in some applications a decision must be made, and equal misclassification costs is reasonable for some applications. Basically the choice of metric depends on the needs of the application, not the properties of the dataset (such as imbalance). Sometimes assigning everything to the majority class is exactly the right thing to do, in which case the performance metric needs to reflect that. – Dikran Marsupial Apr 21 '23 at 13:47
Thanks @lcrmorin header=1 did indeed fix the problem. The accuracy iwithout SMOTE is 0.9989887924496503 and with SMOTE it is 0.9863486980702789 (to a ridiculously large number of d.p. ;o) – Dikran Marsupial Apr 21 '23 at 13:51
1

**Some comments have been [moved to chat](https://chat.stackexchange.com/rooms/145492/discussion-on-answer-by-brian-spiering-reproducible-examples-where-balancing-the).** Before posting a comment below this one, please review the [purposes of comments](/help/privileges/comment). Comments that do not request clarification or suggest improvements usually belong as an [answer](/help/how-to-answer), on [meta], or in [chat]. Comments continuing discussion may be removed. – Ben Reiniger Apr 21 '23 at 15:30

Reproducible examples where balancing the training data demonstrably improves accuracy

3 Answers3