How to choose the right threshold for binary classification?

Question

I am currently working on the titanic dataset from Kaggle. The data set is imbalanced with almost 61.5 % negative and 38.5 positive class.

I divided my training dataset into 85% train and 15% validation set. I chose a support vector classifier as the model. I did 10-fold Stratified cross-validation on the training set, and I tried to find the optimal threshold to maximize the f1 score for each of the folds. Averaging all of the thresholds obtained on the validation folds, the threshold has a mean of 35% +/- 10%.

After that, I test the model on the validation set and estimated the threshold for maximizing F1 score on the validation set. The threshold for the validation set is about 63%, which is very far from the threshold obtained during cross validation.

I tested the model on the holdout test set from Kaggle and I am unable to get a good score for both of the thresholds (35% from cross-validation of train set and 63% from the validation set.)

How does one determine the optimal threshold from the available dataset which could work well on unseen data? Do I choose the threshold obtained from cross-validation or from the validation set? or am I doing it completely wrong? I would appreciate any help and advice regarding this.

For this Dataset, I am looking to maximize my score on the scoreboard by getting the highest accuracy.

Thank you.

Although I originally wanted to get the highest f1 score, for this Kaggle competition, the metric used for scoring is accuracy. But I would like to know how to optimize the threshold to get the highest F1 score too — Joe, Jun 16 '21 at 10:18

hH1sG0n3 · Answer 1 · 2023-01-05T10:38:20.443

In short, you should be the judge of that: depending on the precision (interested to minimise "false alarms/FP") and recall (interested to minimise "missed positives/FN") you want your classifier to have.

The appropriate way to look into precision-recall value pairs at different thresholds is a precision-recall curve (PRC) (especially if you want to focus on the minority class). Via a PRC, you can find the optimal threshold as far as model performance go as a function of precision and recall.

I copy below a pseudo-snippet:

from sklearn.metrics import precision_recall_curve


model.fit(trainX, trainy)
preds = model.predict_proba(testX)

# calculate pr curve
precision, recall, thresholds = precision_recall_curve(labels, preds)

# convert to f score
fscore = (2 * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix = argmax(fscore)
print('Best Threshold=%f, F-Score=%.3f' % (thresholds[ix], fscore[ix]))

sauce for code

The PRC would look like this:

You can alternatively follow the equivalent approach for ROC curves.

Thank you for your reply. But let me ask another question, would the ideal threshold from the precision-recall curve of the validation set (i.e, when the data is split into train and validation set) be the ideal threshold on unseen data too? or, perhaps should I also cross-validate the train set using stratified folds and obtained corresponding thresholds from the precision-recall curves for each fold? — Joe, Jun 16 '21 at 16:06
You can calculate a PRC and respective best threshold on your testset. But, if your question is "how do I get best performance in relation to precision-recall" you should use either f1 or average precision score for scoring during hyperparameter optimisation. — hH1sG0n3, Jun 17 '21 at 10:59

How to choose the right threshold for binary classification?

1 Answers1

Linked