3

Probabilistic classifiers look really good because they give you more information than deterministic ones i.e. estimated probabilities of class memberships rather than just which class the model thinks an individual datum should belong to.

So in what circumstances would you choose a deterministic classifier rather than a probabilistic one?

Abijah
  • 181
  • 7
  • 2
    Of possible interest: https://stats.stackexchange.com/q/494023/247274 – Dave Apr 01 '21 at 21:58
  • 2
    (+1) but see the counter example, where proper scoring rules select the wrong model, here https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models/538524#538524 – Dikran Marsupial Aug 16 '21 at 14:45

1 Answers1

3

It is worth considering using a "deterministic" classifier if all of the following conditions are met:

  1. the false-positive and false-negative misclassification costs are known beforehand and are either fixed, or you don't mind retraining your model when the change;
  2. the relative class frequencies in operation are known beforehand and are either fixed, or you don't mind retraining your model when they change;
  3. ou don't need a "reject" option (although there are ways around that).

In those situations, you might want to use a classifier like the Support Vector Machine that focusses on solving the classification problem directly. The reason for this is that a probabilistic classifier tries to predict the probability accurately everywhere, and will expend modelling resources doing so. A discrete/deterministic classifier, on the other hand, only focuses resources on estimating the position of one particular contour of probability, as that gives the optimal decision boundary, so in principle it can make better use of the available data.

The nice thing about probabilistic classifiers is that you can adjust for changes in misclassification costs, or changes in relative class frequencies, or implement a "reject" option easily and without having to retrain the model. The downside is they make slightly less good use of the data as they consider features of the data distribution that are not relevant to the optimal classification, see my example of that here: https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models/538524#538524 . So if you don't need/want any of those nice properties of probabilistic classifiers, you might get better results using a discrete classifier (and the success of the SVM gives evidence that is true in a variety of practical applications).

In short, have both sets of tools in your data science toolbox as both are useful.

Abijah
  • 181
  • 7
Dikran Marsupial
  • 401
  • 2
  • 10
  • Could you explain that in more detail? and explain whether you need all of i-iii or just one of i-iii and why if you have i-iii that means that deterministic modelling is better – Abijah Aug 17 '21 at 16:33
  • It basically requires all of i-iii (I've edited the answer to clarify that). The example I linked to is as good an explanation of why a discrete classifier can do better than a probabilistic classifier if i-iii are all met, but to get into it in more detail would mean getting into computational learning theory (which I am not particularly expert in) – Dikran Marsupial Aug 17 '21 at 16:46
  • Thanks! what do we mean by a reject option? – Abijah Aug 17 '21 at 21:45
  • 1
    If the difference in probability is too small, for instance P(A|X) = 0.45; P(B|X) = 0.46 and P(C|X) = 0.09, then that means the classifier is pretty sure it isn't class C, but it is not very confident between whether it is class A or class B. Say those are diagnoses of similar cancers with different treatment regimes. In that case it might be better to choose not to classify at this stage and gather more information to make a more confident, better choice later, to give a better chance of picking the right treatment. It is difficult to do that with a deterministic classifier. – Dikran Marsupial Aug 18 '21 at 07:15
  • That makes sense - Could you explain the effect of 2? Why does having varying class frequencies mean that a probabilistic model is better?/why does having fixed class frequencies make it less useful? – Abijah Aug 18 '21 at 09:45
  • 1
    In Baye's rule, the probability of class membership $(P(C_i|X) \propto P(X|C)P(C_i)$, so it depends on the class frequency. This means the classifier will learn to assign probabilties of class membership for the class frequencies in the training set, and if the operational class frequencies are different, the probabilities will be wrong. If they are fixed, you can accommodate that by resampling the training data or weighting the training patterns. – Dikran Marsupial Aug 18 '21 at 09:49
  • For probabilistic classifiers, it can be done after training, see https://datascience.stackexchange.com/questions/93730/doesnt-over-undersampling-an-imbalanced-dataset-cause-issues/93853#93853 – Dikran Marsupial Aug 18 '21 at 09:49
  • Could you explain the significance of 1? Isn't the cost function determined by the person training the model? – Abijah Aug 19 '21 at 09:32
  • 1
    In e.g. a medical diagnosis test, the cost of classifying the patient as having cancer when they don't is fairly small, they will be worried, but the error will be spotted when further tests are carried out. The costs of a false negative - telling them they don't have cancer when they do is far higher as they may die or become very severely ill before the error is detected. So we should build these costs into the training criterion (or in setting the decision boundary for probabilistic classifiers) – Dikran Marsupial Aug 19 '21 at 09:35
  • @DikranMarsupial I'm really enjoying your inputs on this topic, but could you please clarify what is the intuition behind $P(X|C)$ for a classifier? What does "probability of observing a data point given a class" really mean? Where is the model in that? – hirschme May 10 '22 at 20:59
  • 1
    @hirschme $P(X|C)$ describes how the data for patterns belonging to a particular class are distributed. In parametric classifiers (e.g. Naive Bayes or Gaussian classifiers), $P(X|C)$ appears explicitly in the model, but most modern classifiers are "non-parametric" which means that the likelihood is not explicitly written down in the model formulation. If we are just interested in classification, we don't need to model parts of $P(X|C)$ that don't tell us anything about the decision boundary, so non-parametric models can be simpler. – Dikran Marsupial May 11 '22 at 09:34
  • "most modern classifiers are "non-parametric" which means that the likelihood is not explicitly written down in the model formulation". Isn't any method based on maximum likelihood estimation based on likelihood? This includes regression (logistic regression, neural networks trained on cross-entropy, etc). All of these use the likelihood framed as $P(X,Y | Model)$ = "Probability or cost of seeing a data point (observation + outcome/label) given a model/parameterization". Is this not a likelihood? – hirschme May 11 '22 at 21:28
  • 1
    @hirschme yes, but even if the likelihood is used in the training criterion, it may not be explicitly visible in the structure of the model. The likelihood is usually frames as $P(X,Y|Model)$, as the density of $X$ is only included via the sampling of the training patterns. Note that only $Y$ appears explicitly in the cross-entropy, X only appears indirectly via the model output. – Dikran Marsupial May 12 '22 at 01:22