Questions tagged [smote]

Synthetic Minority Oversampling Technique (SMOTE) is an approach used for dealing with imbalanced datasets before running them through machine learning models.

Synthetic Minority Oversampling Technique (SMOTE) is an approach used for dealing with imbalanced datasets before running them through machine learning models. Common techniques for dealing with imbalanced datasets include over or under sampling either the minority or majority class. In this case, as the name suggests, SMOTE is a technique used to oversample the minority class. SMOTE can thereby create more balanced datasets that are less influenced by the majority class.

99 questions
25
votes
3 answers

How do you apply SMOTE on text classification?

Synthetic Minority Oversampling Technique (SMOTE) is an oversampling technique used in an imbalanced dataset problem. So far I have an idea how to apply it on generic, structured data. But is it possible to apply it on text classification problem?…
catris25
  • 369
  • 1
  • 3
  • 5
20
votes
4 answers

Train/Test Split after performing SMOTE

I am dealing with a highly unbalanced dataset so I used SMOTE to resample it. After SMOTE resampling, I split the resampled dataset into training/test sets using the training set to build a model and the test set to evaluate it. However, I am…
Edamame
  • 2,705
  • 5
  • 23
  • 32
11
votes
2 answers

Oversampling/Undersampling only train set only or both train and validation set

I am working on a dataset with class imbalance problem. Now, I know one needs to oversample or undersample only the train set and not the test set. But my issue is: whether to oversample the train set and then split it to train and validate set or…
yamini goel
  • 711
  • 3
  • 7
  • 14
11
votes
4 answers

Why SMOTE is not used in prize-winning Kaggle solutions?

Synthetic Minority Over-sampling Technique SMOTE, is a well known method to tackle imbalanced datasets. There are many papers with a lot of citations out-there claiming that it is used to boost accuracy in unbalanced data scenarios. But then, when I…
Carlos Mougan
  • 6,011
  • 2
  • 15
  • 45
9
votes
1 answer

Why you shouldn't upsample before cross validation

I have an imbalanced dataset and I am trying different methods to address the data imbalance. I found this article that explains the correct way to cross-validate when oversampling data using SMOTE technique. I have created a model using AdaBoost…
sums22
  • 407
  • 5
  • 13
8
votes
1 answer

What is the best performance metric used in balancing dataset using SMOTE technique

I used smote technique to oversample my dataset and now I have a balanced dataset. The problem I faced is that the performance metrics; precision, recall, f1 measure, accuracy in the imbalanced dataset are better performed than with balanced…
Rawia Sammout
  • 199
  • 1
  • 3
  • 16
7
votes
2 answers

Why class weight is outperforming oversampling?

I am applying both class_weight and oversampling (SMOTE) techniques on a multiclass classification problem and getting better results when using the class_weight technique. Could someone please explain what could be the cause of this difference?
Sarah
  • 601
  • 2
  • 5
  • 17
6
votes
1 answer

SMOTE vs SMOTE-NC for binary classifier with categorical and numeric data

I am using Xgboost for classification. My y is 0 or 1 (true or false). I have categorical and numeric features, so theoretically, I need to use SMOTE-NC instead of SMOTE. However, I get better results with SMOTE. Could anyone explain why this is…
6
votes
1 answer

How to avoid resampling part of pipeline on test data (imblearn package, SMOTE)

I am using the imblearn package to resample some data before applying other transformation/prediction techniques. Specfically, I am using SMOTE in a slightly unconventional way, as a data augmentation technique to upsample all classes rather than…
asher1213
  • 91
  • 2
6
votes
3 answers

SMOTE and multi class oversampling

I have read that the SMOTE package is implemented for binary classification. In the case of n classes, it creates additional examples for the smallest class. Can I balance all the classes by running the algorithm n-1 times?
atos
  • 81
  • 1
  • 1
  • 5
6
votes
1 answer

How does SMOTE work for dataset with only categorical variables?

I have a small dataset of 977 rows with a class proportion of 77:23. For the sake of metrics improvement, I have kept my minority class ('default') as class 1 (and 'not default' as class 0). My input variables are categorical in nature. So, the…
5
votes
2 answers

SMOTE for multilabel classification

I have a dataset with 77 different labels. Each sample has one or more of these labels. I did some data analysis and found out that the dataset is highly imbalanced - there are a large number of examples that have a particular label, whereas the…
4
votes
1 answer

Why removing rows with NA values from the majority class improves model performance

I have an imbalanced dataset like so: df['y'].value_counts(normalize=True) * 100 No 92.769441 Yes 7.230559 Name: y, dtype: float64 The dataset consists of 13194 rows and 37 features. I have tried numerous attempts to improve the…
sums22
  • 407
  • 5
  • 13
4
votes
1 answer

SMOTE for regression

I am looking into upsampling an imbalanced dataset for a regression problem (Numerical target variables) in python. I attached paper and R package that implement SMOTE for regression, can anyone recommend a similar package in Python? Otherwise, what…
thereandhere1
  • 715
  • 1
  • 7
  • 22
4
votes
1 answer

Combining 'class_weight' with SMOTE

This might sound a weird question, but I could not find enough details in sklearn documentation about 'class_weight'. Can we first oversample the dataset using SMOTE and then call the classifier with the 'class_weight' option? As my testing set is…
1
2 3 4 5 6 7