5

I am currently working on the titanic dataset from Kaggle. The data set is imbalanced with almost 61.5 % negative and 38.5 positive class.

I divided my training dataset into 85% train and 15% validation set. I chose a support vector classifier as the model. I did 10-fold Stratified cross-validation on the training set, and I tried to find the optimal threshold to maximize the f1 score for each of the folds. Averaging all of the thresholds obtained on the validation folds, the threshold has a mean of 35% +/- 10%.

After that, I test the model on the validation set and estimated the threshold for maximizing F1 score on the validation set. The threshold for the validation set is about 63%, which is very far from the threshold obtained during cross validation.

I tested the model on the holdout test set from Kaggle and I am unable to get a good score for both of the thresholds (35% from cross-validation of train set and 63% from the validation set.)

enter image description here

How does one determine the optimal threshold from the available dataset which could work well on unseen data? Do I choose the threshold obtained from cross-validation or from the validation set? or am I doing it completely wrong? I would appreciate any help and advice regarding this.

For this Dataset, I am looking to maximize my score on the scoreboard by getting the highest accuracy.

Thank you.

hH1sG0n3
  • 1,978
  • 7
  • 27
Joe
  • 75
  • 1
  • 5
  • 1
    Highest accuracy or $F_1$? – Dave Jun 16 '21 at 09:50
  • Although I originally wanted to get the highest f1 score, for this Kaggle competition, the metric used for scoring is accuracy. But I would like to know how to optimize the threshold to get the highest F1 score too – Joe Jun 16 '21 at 10:18

1 Answers1

5

In short, you should be the judge of that: depending on the precision (interested to minimise "false alarms/FP") and recall (interested to minimise "missed positives/FN") you want your classifier to have.

The appropriate way to look into precision-recall value pairs at different thresholds is a precision-recall curve (PRC) (especially if you want to focus on the minority class). Via a PRC, you can find the optimal threshold as far as model performance go as a function of precision and recall.

I copy below a pseudo-snippet:

from sklearn.metrics import precision_recall_curve


model.fit(trainX, trainy)
preds = model.predict_proba(testX)

# calculate pr curve
precision, recall, thresholds = precision_recall_curve(labels, preds)

# convert to f score
fscore = (2 * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix = argmax(fscore)
print('Best Threshold=%f, F-Score=%.3f' % (thresholds[ix], fscore[ix]))

sauce for code

The PRC would look like this: PRC

You can alternatively follow the equivalent approach for ROC curves.

hH1sG0n3
  • 1,978
  • 7
  • 27
  • 1
    Thank you for your reply. But let me ask another question, would the ideal threshold from the precision-recall curve of the validation set (i.e, when the data is split into train and validation set) be the ideal threshold on unseen data too? or, perhaps should I also cross-validate the train set using stratified folds and obtained corresponding thresholds from the precision-recall curves for each fold? – Joe Jun 16 '21 at 16:06
  • 1
    You can calculate a PRC and respective best threshold on your testset. But, if your question is "how do I get best performance in relation to precision-recall" you should use either f1 or average precision score for scoring during hyperparameter optimisation. – hH1sG0n3 Jun 17 '21 at 10:59