XGBoost for binary classification: choosing the right threshold

Question

I am working on a highly-imbalanced binary-labeled dataset, where number of true labels is just 7% from the whole dataset. But some combination of features could yield higher than average number of ones in a subset.

E.g. we have the following dataset with a single feature (color):

180 red samples — 0

20 red samples — 1

300 green samples — 0

100 green samples — 1

We can build a simple decision tree:

                      (color)

                red /       \ green

 P(1 | red) = 0.1              P(1 | green) = 0.25

P(1) = 0.2 for the overall dataset

If I run XGBoost on this dataset it can predict probabilities no larger that 0.25. Which means, that if I make a decision at 0.5 threshold:

0 - P < 0.5
1 - P >= 0.5

Then I will always get all samples labeled as zeroes. Hope that I clearly described the problem.

Now, on the initial dataset I am getting the following plot (threshold at x-axis):

Having maximum of f1_score at threshold = 0.1. Now I have two questions:

should I even use f1_score for a dataset of such a structure?
is it always reasonable to use 0.5 threshold for mapping probabilities to labels when using XGBoost for binary classification?

Update. I see that topic draws some interest. Below is the Python code to reproduce red/green experiment using XGBoost. It actually outputs the expected probabilities:

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
import numpy as np

X0_0 = np.zeros(180) # red - 0
Y0_0 = np.zeros(180)

X0_1 = np.zeros(20) # red - 1
Y0_1 = np.ones(20)

X1_0 = np.ones(300) # green - 0
Y1_0 = np.zeros(300)

X1_1 = np.ones(100) # green  - 1
Y1_1 = np.ones(100)

X = np.concatenate((X0_0, X0_1, X1_0, Y1_1))
Y = np.concatenate((Y0_0, Y0_1, Y1_0, Y1_1))

# reshaping into 2-dim array
X = X.reshape(-1, 1)

import xgboost as xgb

xgb_dmat = xgb.DMatrix(X_train, label=y_train)

param = {'max_depth': 1,
         'eta': 0.01,
         'objective': 'binary:logistic',
         'eval_metric': 'error',
         'nthread': 4}

model = xgb.train(param, xg_mat, 400)

X0_sample = np.array([[0]])
X1_sample = np.array([[1]])

print('P(1 | red), predicted: ' + str(model.predict(xgb.DMatrix(X0_sample))))
print('P(1 | green), predicted: ' + str(model.predict(xgb.DMatrix(X1_sample))))

Output:

P(1 | red), predicted: [ 0.1073855]
P(1 | green), predicted: [ 0.24398108]

score 9 · Answer 1 · answered Mar 27 '17 at 01:15

9

You have to decide what you want to maximize.

Classifying by comparing the probability to 0.5 is appropriate if you want to maximize accuracy. It's not appropriate if you want to maximize the f1 metric.

If you want to maximize accuracy, always predicting zero is the optimal classifier.

Alternatively, given a probability score $p$, another option is to randomly flip a biased coin; with probability $p$, output classification 1, otherwise output classification 0. This doesn't always predict zero. However it's probably not actually any better in any useful way.

If you want to maximize f1 metric, one approach is to train your classifier to predict a probability, then choose a threshold that maximizes the f1 score. The threshold probably won't be 0.5.

Another option is to understand the cost of type I errors vs type II errors, and then assign class weights accordingly.

answered Mar 27 '17 at 01:15

D.W.

3,312
15
42

1

I just want to mention two more things: (a) instead of F1 score, the OP can also use weighted accuracy, or even maximize a ranking metric such as AUC ROC (b) `xgboost` supports class weights, the OP should play with those if he is unsatisfied with whatever metric he wants to maximize. – Ricardo Magalhães Cruz Mar 31 '17 at 09:27
@RicardoCruz, thanks -- good suggestions! (I did mention class weights briefly -- last paragraph of the answer -- but I'm glad you highlighted it.) – D.W. Mar 31 '17 at 20:21
Btw, a common class_weights used are inverse frequencies: `n_samples / (n_classes * np.bincount(y))`. This avoids the classifier giving more weight to more populous classes. – Ricardo Magalhães Cruz Apr 01 '17 at 18:23
Not sure I understand "If you want to maximize accuracy, always predicting zero is the optimal classifier." Always predicting correctly should be the optimal, with 100% accuracy . No? – flow2k Aug 08 '20 at 11:10
@flow2k, No. That's not correct. If you look at the specific example they give, it is not possible to always predict correctly. When you see a red sample, you must choose whether to predict zero or one. If you predict zero, you will be correct 90% of the time. If you predict one, you will be correct 10% of the time. Consequently, the optimal solution is to predict zero when you see a red sample. There is no option that lets you be correct 100% of the time (in the example listed in the question, the color doesn't provide enough information to predict with 100% accuracy). – D.W. Aug 08 '20 at 19:40
@D.W. I see now - thanks for clarifying this. – flow2k Aug 10 '20 at 03:13

XGBoost for binary classification: choosing the right threshold

1 Answers1

Linked