Improving accuracy on highly imbalanced dataset

Question

I need some suggestions to improve my model accuracy.

The training data shape is : (166573, 14)

It has all int and float columns. I have dropped claims_daysaway column as most of values are NaN and replaced Nan value with mean for tier column.

X_train = train.drop(['outcome','testindex','claims_daysaway'], axis=1)
y_train = train['outcome']

As the values were on different scale, I have used StandScaler() to standardize values.

This dataset is highly imbalanced.

train['outcome'].value_counts()

0    159730 
1      6843

I tried SMOTE for oversampling.

from imblearn.over_sampling import SMOTE
smt = SMOTE()
X_train, y_train = smt.fit_sample(X_train, y_train)
pd.value_counts(pd.Series(y_train))

1    159730
0    159730

Lastly, I fit model using XGBClassifier but when tried this model on testdata and submitted it, it gives only 60% roc_auc_score.

Please suggest how to handle imbalanced dataset better.

score 5 · Answer 1 · answered Apr 21 '19 at 19:16

I'm not very sure what you mean by "60% accuracy using AUC". Accuracy and AUC are two different metrics... I'm going to answer as if you're referring to classification accuracy, since that's in your title and the first sentence of your post.

First of all, don't use accuracy to evaluate performance on imbalanced data!

Your dataset has an imbalance ratio of 6843/159730 which is around 1/23. This means that if you make a dummy classifier that just predicts the majority class you'd get an accuracy of 96%. There are better options for imbalanced data such as the f1 score or any macro-averaged metric (you can read this post more information).

Secondly, I'm not sure what you're doing it but, just in any case, you shouldn't evaluate on the oversampled dataset.

As for ideas for improving performance, I don't have many because you are doing most things right. Tree-based algorithms (e.g. XGBoost) are good for dealing with imbalanced data. You are already oversampling the data, which helps a lot. Some other ideas are:

Try different oversamplers, undersamplers or perhaps a combination of over and under-sampling techniques.
Search to optimize the hyperparameters of your XGBoost. I can't tell by the information you gave, but maybe you're overfitting.
Try different algorithms (catboost, lightgbm, etc.), or maybe ensembles of those models (stacked models, etc.).

Improving accuracy on highly imbalanced dataset

1 Answers1