17

If I have a training data set and I train a Naive Bayes Classifier on it and I have an attribute value which has probability zero. How do I handle this if I later want to predict the classification on new data? The problem is, if there is a zero in the calculation the whole product becomes zero, no matter how many other values I got which maybe would find another solution.

Example:

$P(x|spam=yes) = P(TimeZone = US | spam=yes) \cdot P(GeoLocation = EU | spam = yes) \cdot ~ ... ~ = 0.004 $

$P(x|spam=no) = P(TimeZone = US | spam=no) \cdot P(GeoLocation = EU | spam = no) \cdot ~ ... ~ = 0 $

The whole product becomes $0$ because in the training data the attribute TimeZone US is always Yes in our small training data set. How can I handle this? Should I use a bigger set of training data or is there another possibility to overcome this problem?

timleathart
  • 3,900
  • 20
  • 35
fragant
  • 323
  • 1
  • 2
  • 6

1 Answers1

14

An approach to overcome this 'zero frequency problem' in a Bayesian setting is to add one to the count for every attribute value-class combination when an attribute value doesn’t occur with every class value. So, for example, say your training data looked like this:

$$\begin{array}{c|c|c|} & \text{Spam} = yes & \text{Spam} = no \\ \hline \text{TimeZone} = US & 10 & 5 \\ \hline \text{TimeZone} = EU & 0 & 0 \\ \hline \end{array}$$

$ P(\text{TimeZone} = US | \text{Spam} = yes) = \frac{10}{10} = 1$

$P(\text{TimeZone} = EU | \text{Spam} = yes) = \frac{0}{10} = 0$

Then you should add one to every value in this table when you're using it to calculate probabilities:

$$\begin{array}{c|c|c|} & \text{Spam} = yes & \text{Spam} = no \\ \hline \text{TimeZone} = US & 11 & 6 \\ \hline \text{TimeZone} = EU & 1 & 1 \\ \hline \end{array}$$

$ P(\text{TimeZone} = US | \text{Spam} = yes) = \frac{11}{12}$

$P(\text{TimeZone} = EU | \text{Spam} = yes) = \frac{1}{12}$

timleathart
  • 3,900
  • 20
  • 35
  • 4
    Indeed. Note that sometime you might add values other than one. For details see https://en.wikipedia.org/wiki/Additive_smoothing – DaL Dec 06 '16 at 07:46