Questions tagged [imbalanced-learn]

Imbalanced Learn is a python package used specifically for dealing with imbalanced data in machine learning contexts.

Imbalanced Learn is a python package used specifically for dealing with imbalanced data in machine learning contexts. It contains various techniques for implementing both over-sampling and under-sampling methods on data sets. One popular method, included, is to use SMOTE for over-sampling. This package is fully compatible with scikit-learn.

50 questions
11
votes
3 answers

For imbalanced classification, should the validation dataset be balanced?

I am building a binary classification model for imbalanced data (e.g., 90% Pos class vs 10% Neg Class). I already balanced my training dataset to reflect a a 50/50 class split, while my holdout (training dataset) was kept similar to the original…
thereandhere1
  • 715
  • 1
  • 7
  • 22
6
votes
1 answer

Difference between sklearn make_pipeline and imblearn make_pipeline

Can anybody please explain the difference between sklearn.pipeline.make_pipline and imblearn.pipeline.make_pipline.
6
votes
2 answers

Can we specify the number of data generated(minority class) using SMOTE?

I am trying to improve classification of imbalanced dataset creditcard fraud using SMOTE imbalanced_learn. But, in this it generates the data to 50%, can we give a specific number for the data to be generated? I want to track the classifier…
4
votes
1 answer

Why is oversampling outperforming class weight?

I have a dataset that is highly imbalanced. One class has 412 (class 0) samples while the other has 67215 (class 1) samples. For its classification, I am using MLP. When I use class weight of 165 for class 0 and 1 for class 1 (in keras), I am…
4
votes
1 answer

SMOTE for regression

I am looking into upsampling an imbalanced dataset for a regression problem (Numerical target variables) in python. I attached paper and R package that implement SMOTE for regression, can anyone recommend a similar package in Python? Otherwise, what…
thereandhere1
  • 715
  • 1
  • 7
  • 22
4
votes
3 answers

How to Split And Resample Imbalanced Dataset Into Train, Validation and Test

I want to understand how to split the imbalanced data set with a binary target variable where 87% of the samples are negative and 13% of the samples are positive. Now, I know that you should always split the data into train and test set before doing…
4
votes
1 answer

Combining 'class_weight' with SMOTE

This might sound a weird question, but I could not find enough details in sklearn documentation about 'class_weight'. Can we first oversample the dataset using SMOTE and then call the classifier with the 'class_weight' option? As my testing set is…
4
votes
3 answers

Reproducible examples where balancing the training data demonstrably improves accuracy

I asked this question on the Statistics SE, but there were no answers, even when a modest bonus was available, so I am asking here to see if any examples can be given. I have been looking into the imbalanced learning problem, where a classifier is…
Dikran Marsupial
  • 401
  • 2
  • 10
3
votes
1 answer

What does IBA mean in imblearn classification report?

imblearn is a python library for handling imbalanced data. A code for generating classification report is given below. import numpy as np from imblearn.metrics import classification_report_imbalanced y_true = [0, 1, 2, 2, 2] y_pred = [0, 0, 2, 2, 1]…
codeczar
  • 153
  • 1
  • 4
  • 22
3
votes
1 answer

The most informative curve for imbalance datasets

For the imbalanced datasets: Can we say the Precision-Recall curve is more informative, thus accurate, than ROC curve? Can we rely on F1-score to evaluate the skillfulness of the resulted model in this case?
3
votes
0 answers

Balancing the dataset using imblearn undersampling, oversampling and combine?

I have the imbalanced dataset: data['Class'].value_counts() Out[22]: 0 137757 1 4905 Name: Class, dtype: int64 X_train, X_valid, y_train, y_valid = train_test_split(input_x, input_y, test_size=0.20,…
hanzgs
  • 163
  • 1
  • 1
  • 5
3
votes
3 answers

imbalanced dataset in text classififaction

I have a data set collected from Facebook consists of 10 class, each class have 2500 posts, but when count number of unique words in each class, they has different count as shown in the figure Is this an imbalanced problem due to word count , or…
mtesta010
  • 33
  • 1
  • 4
3
votes
1 answer

How to use SMOTENC inside the Pipeline?

I would greatly appreciate if you could let me know how to use SMOTENC. I wrote: num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:123]].columns.values) cat_indices1 =…
ebrahimi
  • 1,277
  • 7
  • 20
  • 39
2
votes
1 answer

Using SMOTENC in a pipeline

I am trying to figure out the appropriate way to build a pipeline to train a model which includes using the SMOTENC algorithm: Given that the N-Nearest Neighbors algorithm and Euclidian distance are used, should the data by normalized (Scale input…
thereandhere1
  • 715
  • 1
  • 7
  • 22
2
votes
1 answer

Positively skewed target label in regression

I have a dataset where the target label is positively skewed and produces a long tail, and currently I have a high residual on these values when experimenting with some linear, tree-based and neural-network regression models. I see the same problem…
1
2 3 4