Flipping the labels in a binary classification gives different model and results

Question

I have an imbalanced dataset and I want to train a binary classifier to model the dataset.

Here was my approach which resulted into (relatively) acceptable performance:

1- I made a random split to get train/test sets.

2- In the training set, I down-sampled the majority class to make my training set balanced. To do that, I used the resample method from sklearn.utils module.

3- I trained the model, and then evaluated the performance of the model on the test set (which is unseen and still imbalanced). I got fairly acceptable results including precision, recall, f1 score and AUC.

Afterwards, I wanted to try out something. Therefore, I flipped the labels in both training set and testing set (i.e. converting 1 to 0 and 0 to 1).

Then I repeated the step 3 and trained the model again with flipped labels. This time, the performance of model dropped and I got much lower precision and f1 score on test set.

Additional details: The model was trained with GridSearchCV using a LogisticRegression estimator.

I have then two question: is there anything wrong with my approach (i.e. downsampling)? and how come flipping the label led into worse results? I have a feeling that it could be due to the fact that my test set is still imbalanced. But more insight will be appreciated.

score 5 · Accepted Answer · answered Nov 03 '22 at 15:55

First I'd like to say that you're asking the right questions, and doing an experiment like this is good way to understand how things work.

Your approach is not wrong by itself and the performance difference is not due to downsampling. Actually resampling rarely works well, it's a very simplistic approach to handle class imbalance, but that's a different topic.
The second question is more important, and it's about what precision and recall mean: these measures rely on which class is defined as the positive class, thus it is expected that flipping the label would change their value.

For example, let's assume a confusion matrix:

     A    B  <- predicted as class
A    9    1
B   10   70 
^
true class

Precision is the proportion of correct predictions (true positive, TP) among instances predicted as positive (TP+FP):

if A = positive: 9/(9+10) = 0.47
if B = positive: 70/(10+70) = 0.87

Recall would also be different. The logic of these measures is that the task is defined with a specific class as "main target", usually the minority class (A in this example). So flipping the labels is like changing the definition of the task, there's no reason that the performance has to be the same.

Note: accuracy would have the same value no matter the positive class, and this is why it's not recommended with an imbalanced dataset (it gives too much weigth to the majority class).

That was a good example. It explained the drop in the precision and f1 score. I got the values for TP, TN, FP, and FN for both models. They were also flipped (as expected) and were, as a matter of fact, almost identical (i.e. FN=FP, ...). — Farzad, Nov 04 '22 at 07:26

score 2 · Answer 2 · answered Nov 03 '22 at 15:55

Precision is defined as

$$ Precision = \frac{True\,Positive}{True\,Positive + False\,Positive} $$

Now, what happens when you invert your classes, your true positive and true negative cases switch place and your false positive and false negative switch places. This way your precision for your class 1 becomes the precision for class 0. Have a look at this for illustration:

from sklearn.metrics import precision_score, confusion_matrix
y_true = [0, 1, 0, 0, 1, 1, 0, 0]
y_trye_inverse = [not i for i in y_true]
y_pred = [1, 1, 1, 0, 0, 1, 1, 1]
y_pred_inverse = [not i for i in y_pred]
print(confusion_matrix(y_true, y_pred))
# [[1 4]
#  [1 2]]
print(precision_score(y_true, y_pred))
# 0.3333333333333333
# Confusion Matrix is transposed for inverted classes:
print(confusion_matrix(y_trye_inverse, y_pred_inverse))
# [[2 1]
#  [4 1]]
# Precision score is calculated for the other class than originally
print(precision_score(y_trye_inverse, y_pred_inverse))
# 0.5
# Telling scikit-learn to use 0 instead of 1 fixes this
print(precision_score(y_trye_inverse, y_pred_inverse, pos_label=0))
# 0.3333333333333333

As for downsampling: What you're doing is one way to handle it.

Flipping the labels in a binary classification gives different model and results

2 Answers2