Hypertune xgboost to dealing with imbalanced dataset

Question

My training data has extremely class imbalanced {0:872525,1:3335} with 100 features. I use xgboost to build classification model with bayessian optimisation to hypertune the model in range {learning rate:(0.001,0.1), min_split_loss:(0.10), max_depth:(3,70), min_child_weight:(1:20), max_delta_step:(1,20), subsample:(0:1), colsample_bytree:(0.5,1), lambda:(0,10), alpha:(0,10), scale_pos_weight:(1,262), n_estimator:(1,20)}.

I also use binary:logistics as the objective model and roc_auc as the metrics with booster gbtree. The cross validation score is 82.5%. However, when I implemented the model to the testing data I got the score only Roc_auc: 75.2%, pr_auc: 15%, log_loss: 0.046, and confusion matrix: [[19300 7],[103 14]]. I need helping to find the best way to increase the true possitive with tolerance false positive until 3 times actual positive.

German C M · Answer 1 · 2021-03-01T08:41:58.793

1

Given an imbalanced dataset and focusing on increasing your true positive rate, it is quite relevant to use the right evaluation metric (the one used to validate the model being trained on each evaluation round).

In this case, I recommend you use Precision-Recall AUC instead of ROC AUC, so you force your model to focus on the minority class. A nice post about it can be found here

Another points to take into account could be:

increase the range of possible number of tree estimators in your hyperparameter tuning process
set the scale_pos_weight at about your (majority class samples number)/(minority class samples number)

edited Mar 01 '21 at 08:41

answered Feb 27 '21 at 09:17

German C M

2,674
4
18

Thanks for the answer. I have tried to change the cross validation metrics to Precision-Recall AUC and got score 14%, but no significant changes on testing score. I also tried to change the objective model to be 'rank:map' but got worse result. Any other idea? – zonna Mar 01 '21 at 06:09
I see your number of trees range between 1 and 20, you can check to increase the n_estimators range of values to even hundreds or thousands, among other things. Additional checks could be made on your dataset (was it already labeled?) and preprocessing steps. – German C M Mar 01 '21 at 07:33
Here is the result afteroptimize the PR AUC and increase the estimators. Do you have any other idea? https://datascience.stackexchange.com/questions/92776/high-recall-but-too-low-precision-result-in-imbalanced-data – zonna Apr 09 '21 at 07:54

Hypertune xgboost to dealing with imbalanced dataset

1 Answers1