Questions tagged [oversampling]
15 questions
4
votes
3 answers
Timing of applying random oversampling on the dataset
I tried to learn classification using machine learning algorithms. I went through Breast Cancer - EDA, Balancing and ML the notebook. In this notebook Random Oversampling had been implemented. However, when the person did the oversampling he did it…
Encipher
- 359
- 1
- 9
2
votes
3 answers
unbalanced data on train set and test set
I already have 2 datasets. One to use for training and one for testing.
Both datasets are unbalanced (with similar percentages), with around 90% of label 1 .
Will it be useful to balance the data if the test set is very unbalanced anyway?
Instances…
mikeman
- 21
- 1
1
vote
2 answers
How to properly use oversampling without inflating results?
I am using with a tiny private dataset (over 192 samples) with 4 classes. A preprocessing step is trivial in order to do any classification. Among feature selection and extraction techniques, i decided to apply oversampling (SMOTE). Here is what i…
heresthebuzz
- 395
- 3
- 9
1
vote
1 answer
Does synthetic data be over sampled as well?
I'm building a binary text classifier, the ratio between the positives and negatives is 1:100 (100 / 10000).
By using back translation as an augmentation, I was able to get 400 more positives.
Then I decided to do up sampling to balance the data. Do…
guestmember123456790
- 11
- 1
1
vote
0 answers
oversampling multivariate time series data
For some classification needs. I have multivariate time series data composed from 4 stelite images in form of (145521 pixels, 4 dates, 2 bands) I made a classification with tempCNN to classify the data into 5 classes. However there is a big gap…
ala
- 11
- 2
0
votes
2 answers
How to increase the Accuracy after Oversampling?
The Accuracy before ovesampling :
On Training : 98,54%
On Testing : 98,21%
The Accuracy after ovesampling :
On Training : 77,92%
On Testing : 90,44%
What does mean this and how to increase the accuracy ?
Edit:
Classes before…
Mimi
- 45
- 7
0
votes
0 answers
How to use SMOTE to rebalance multiclass dataset when the target is one hot encoded with pd.get_dummies?
I'm using a multiclass dataset (cic-ids-2017), which is very imbalanced. I have already encoded the categorical feature (which is the target) using OneHotEncoder.
I tried to use SMOTE oversampling method to balance the data with pipeline:
X =…
Mimi
- 45
- 7
0
votes
2 answers
Is it good practice to use SMOTE when you have a data set that has imbalanced classes when using BERT model for text classification?
I had a question related to SMOTE. If you have a data set that is imbalanced, is it correct to use SMOTE when you are using BERT? I believe I read somewhere that you do not need to do this since BERT take this into account, but I'm unable to find…
QMan5
- 133
- 5
0
votes
0 answers
LSTM, seq to classification, why training on balanced data set yields such a good result?
I am using LSTM to classify the origin of people's names.
The input data is not balanced over target classes, so I used oversampling to balance it.
Now, I defined a simple LSTM model as follows:
LSTM_with_Embedding(
(embedding): Embedding(32, 10,…
0
votes
1 answer
SMOTENC oversampling without one-hot encoding
I'm using SMOTENC to oversample an imbalanced-dataset.
I thought the point of SMOTENC was to give the option to oversample categorical features without one-hot encoding them. The reason I don't want to one-hot encode is to avoid Curse of…
Boots
- 1
0
votes
1 answer
A question about overfitting and SMOTE
So I understand that overfitting is when you have for example a good accuracy for the training dataset and bad one for the testing dataset, but why would I even check the accuracy for the training dataset?
If I have a good accuracy on the testing…
FjkgB
- 89
- 7
0
votes
0 answers
Prior probability shift vs oversampling/undersampling imbalanced datasets
I'm trying to understand what prior probability shift (label drift) in data means.
If I understand it correctly then it means that distribution of labels in training dataset differs compared to distribution of labels in production environment. This…
user60175
- 113
- 3
0
votes
0 answers
Oversampling SMOTE sampling strategy ratio
I have 36168 data with imbalanced target. 88,3% is 0 (31970 data) and 11,7% is 1 (4198 data).
I want to apply oversampling using SMOTE. Is it ideal to make it the same amount of data so the 0 & 1 target contains 31970 data? Because i think in the…
Jovian Aditya
- 138
- 6
0
votes
1 answer
Explaining the logic behind the pipe_line method for cross-validation of imbalance datasets
Reading the following article: https://kiwidamien.github.io/how-to-do-cross-validation-when-upsampling-data.html
There is an explanation of how to use from imblearn.pipeline import make_pipeline in order to perform a cross-validation on an…
PwNzDust
- 149
- 3
-1
votes
1 answer
Is my model classification overfitting?
Is this possible to be just a bad draw on the 20% or is it overfitting? I'd appreciate some tips on what's going on.
user135670