In which situation should we consider a dataset as imbalanced?

Question

I'm facing a problem about making a classification on a dataset. The target variable is binary (with 2 classes, 0 and 1). I have 8,161 samples in the training dataset. And for each class, I have:

class 0: 6,008 samples, 73.6% of total numbers.
class 1: 2,153 samples, 26.4%

My questions are:

In this case, should I consider the dataset I used as an imbalanced dataset?
If it was, should I process the data before using RandomForest to make a prediction?
If it was not an imbalanced dataset, could somebody tell me in which situation (like what ratio for each class) I could consider a dataset as imbalanced?

As a heads up, class imbalance almost certainly is not a problem. https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en — Dave, Oct 26 '21 at 09:38

score 1 · Answer 1 · answered Dec 04 '21 at 06:29

Intuitively, it seems like an imbalanced dataset to have ~75/25 ratio of class labels.

If you want to take a look at it theoretically, you can do a hypothesis test. For a sample size of 8161, you can assume that the dataset is 50/50 as null hypothesis and then compute the probability that a number extreme as 6008 or more of them belong to one class as p-value and then try to reject the null hypothesis if the p value is low (less than 0.05 or 0.01 as per choice.)

This can be done using a binomial distribution.

FabC · Answer 2 · 2023-05-21T23:23:41.837

0

You can try ydata-profiling (https://github.com/ydataai/ydata-profiling). There's a property that measure whether a class is imbalanced or not based on entropy, might be helpful.

https://github.com/ydataai/ydata-profiling/blob/master/src/ydata_profiling/model/pandas/imbalance_pandas.py

The concept to validate imbalanced classes is pretty straightforward - on a dataset of n instances, if you have k classes of size Ci you can compute Shanon Entropy as follows:

$https://latex.codecogs.com/svg.image?H = \sum_{k}^{i=1}\frac{Ci}{n}log(\frac{Ci}{n})$

It is one of the most precise metrics I've found, to validate whether the dataset is imbalanced, given Shanon-Entropy is commonly used to measure the impurity or uncertainty within a set of data.

edited May 21 '23 at 23:23

answered May 16 '23 at 02:50

FabC

21
5

Do you mind to point to us specifically which property it is? There are quite a lot of things in this project. – lpounng May 16 '23 at 10:00
As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 21 '23 at 08:00
Just updated the details :) – FabC May 22 '23 at 01:09

score 0 · Answer 3 · answered Jan 17 '21 at 18:41

I think you can speak of imbalanced targets if (in case of a binary classification problem) the classes are not represented in a 50:50 manner. This is almost always the case.

With about 25/75 in your case, I would see this as „imbalanced“. There are some strategies to deal with this problem, such as (re)sampling so that you achieve a 50:50 balanced sample (essentially you will lose observations in the majority class here). Alternatively you can use synthetic oversampling (SMOTE) and related rechniques.

However, some packages come with built-in options to deal with unbalanced targets, e.g. sklearn‘s random forest (option class_weight). Check the docs.

This is bad advice. You should never over sample on unbalanded datasets. Only under-representitive. — GooJ, Sep 10 '22 at 16:11

In which situation should we consider a dataset as imbalanced?

3 Answers3