Questions referring to classifiers or classifying problems where some of the classes in the data are under-represented.
Questions tagged [class-imbalance]
542 questions
59
votes
6 answers
Should I go for a 'balanced' dataset or a 'representative' dataset?
My 'machine learning' task is of separating benign Internet traffic from malicious traffic. In the real world scenario, most (say 90% or more) of Internet traffic is benign. Thus I felt that I should choose a similar data setup for training my…
pnp
- 693
- 1
- 6
- 10
40
votes
6 answers
Unbalanced multiclass data with XGBoost
I have 3 classes with this distribution:
Class 0: 0.1169
Class 1: 0.7668
Class 2: 0.1163
And I am using xgboost for classification. I know that there is a parameter called scale_pos_weight.
But how is it handled for 'multiclass' case, and how can…
shda
- 565
- 1
- 5
- 10
35
votes
4 answers
Quick guide into training highly imbalanced data sets
I have a classification problem with approximately 1000 positive and 10000 negative samples in training set. So this data set is quite unbalanced. Plain random forest is just trying to mark all test samples as a majority class.
Some good answers…
IgorS
- 5,444
- 11
- 31
- 43
27
votes
4 answers
macro average and weighted average meaning in classification_report
I use the "classification_report" from from sklearn.metrics import classification_report in order to evaluate the imbalanced binary classification
Classification Report :
precision recall f1-score support
0 1.00…
user10296606
- 1,784
- 5
- 17
- 31
25
votes
3 answers
How do you apply SMOTE on text classification?
Synthetic Minority Oversampling Technique (SMOTE) is an oversampling technique used in an imbalanced dataset problem. So far I have an idea how to apply it on generic, structured data. But is it possible to apply it on text classification problem?…
catris25
- 369
- 1
- 3
- 5
20
votes
4 answers
Train/Test Split after performing SMOTE
I am dealing with a highly unbalanced dataset so I used SMOTE to resample it.
After SMOTE resampling, I split the resampled dataset into training/test sets using the training set to build a model and the test set to evaluate it.
However, I am…
Edamame
- 2,705
- 5
- 23
- 32
19
votes
4 answers
Macro- or micro-average for imbalanced class problems
The question of whether to use macro- or micro-averages when the data is imbalanced comes up all the time.
Some googling shows that many bloggers tend to say that micro-average is the preferred way to go, e.g.:
Micro-average is preferable if there…
Krrr
- 293
- 1
- 2
- 6
17
votes
3 answers
When should we consider a dataset as imbalanced?
I'm facing a situation where the numbers of positive and negative examples in a dataset are imbalanced.
My question is, are there any rules of thumb that tell us when we should subsample the large category in order to force some kind of balancing in…
Rami
- 594
- 1
- 5
- 16
16
votes
4 answers
What are the implications for training a Tree Ensemble with highly biased datasets?
I have a highly biased binary dataset - I have 1000x more examples of the negative class than the positive class. I would like to train a Tree Ensemble (like Extra Random Trees or a Random Forest) on this data but it's difficult to create training…
gallamine
- 418
- 2
- 8
15
votes
2 answers
Why do we need to handle data imbalance?
I would like to know why we need to deal with data imbalance. I know how to deal with it and different methods to solve the issue - by up sampling or down sampling or by using SMOTE.
For example, if I have a rare disease 1 percent out of 100, and…
sara
- 481
- 7
- 15
13
votes
3 answers
Unbalanced classes -- How to minimize false negatives?
I have a dataset that has a binary class attribute. There are 623 instances with class +1 (cancer positive) and 101,671 instances with class -1 (cancer negative).
I've tried various algorithms (Naive Bayes, Random Forest, AODE, C4.5) and all of them…
user798275
- 293
- 2
- 3
- 5
13
votes
6 answers
Deep network not able to learn imbalanced data beyond the dominant class
I have data with 5 output classes. The training data has the following no of samples for these 5 classes:
[706326, 32211, 2856, 3050, 901]
I am using the following keras (tf.keras) code:
class_weights =…
dbm
- 251
- 1
- 2
- 7
13
votes
1 answer
Why doesn't class weight resolve the imbalanced classification problem?
I know that in imbalanced classification, the classifier tends to predict all the test labels as larger class label, but if we use class weight in loss function, it would be reasonable to expect the problem to be solved. So why we need some…
user137927
- 379
- 1
- 3
- 10
12
votes
1 answer
Cross validation for highly imbalanced data with undersampling
In my problem, I am dealing with a highly imbalanced data set, say for every positive class there are 10000 negative one. A normal starting method to train a model is to undersample the data. In this procedure, it is very important to train our…
Amin Kiany
- 223
- 2
- 6
12
votes
3 answers
How can I perform stratified sampling for multi-label multi-class classification?
I am asking this question for few reasons:
The dataset in hand is imbalanced
I used below code
x = dataset[['Message']]
y = dataset[['Label1', 'Label2']]
train_data, test_data = train_test_split(x, test_size = 0.1, stratify=y, random_state =…
Divyanshu Shekhar
- 549
- 1
- 5
- 15