How to handle a zero factor in Naive Bayes Classifier calculation?

Question

If I have a training data set and I train a Naive Bayes Classifier on it and I have an attribute value which has probability zero. How do I handle this if I later want to predict the classification on new data? The problem is, if there is a zero in the calculation the whole product becomes zero, no matter how many other values I got which maybe would find another solution.

Example:

$P(x|spam=yes) = P(TimeZone = US | spam=yes) \cdot P(GeoLocation = EU | spam = yes) \cdot ~ ... ~ = 0.004 $

$P(x|spam=no) = P(TimeZone = US | spam=no) \cdot P(GeoLocation = EU | spam = no) \cdot ~ ... ~ = 0 $

The whole product becomes $0$ because in the training data the attribute TimeZone US is always Yes in our small training data set. How can I handle this? Should I use a bigger set of training data or is there another possibility to overcome this problem?

If you get a discrete attribute value occurring, its probability cannot be zero, by definition. — Paul, Dec 05 '16 at 18:58
why we add 1 in 0 frequency problem what is logic behind this why not we add another number. — Aftab Hussaiin, Dec 02 '17 at 21:00

score 14 · Accepted Answer · answered Dec 05 '16 at 22:28

An approach to overcome this 'zero frequency problem' in a Bayesian setting is to add one to the count for every attribute value-class combination when an attribute value doesn’t occur with every class value. So, for example, say your training data looked like this:

$$\begin{array}{c|c|c|} & \text{Spam} = yes & \text{Spam} = no \\ \hline \text{TimeZone} = US & 10 & 5 \\ \hline \text{TimeZone} = EU & 0 & 0 \\ \hline \end{array}$$

$ P(\text{TimeZone} = US | \text{Spam} = yes) = \frac{10}{10} = 1$

$P(\text{TimeZone} = EU | \text{Spam} = yes) = \frac{0}{10} = 0$

Then you should add one to every value in this table when you're using it to calculate probabilities:

$$\begin{array}{c|c|c|} & \text{Spam} = yes & \text{Spam} = no \\ \hline \text{TimeZone} = US & 11 & 6 \\ \hline \text{TimeZone} = EU & 1 & 1 \\ \hline \end{array}$$

$ P(\text{TimeZone} = US | \text{Spam} = yes) = \frac{11}{12}$

$P(\text{TimeZone} = EU | \text{Spam} = yes) = \frac{1}{12}$

Indeed. Note that sometime you might add values other than one. For details see https://en.wikipedia.org/wiki/Additive_smoothing — DaL, Dec 06 '16 at 07:46

How to handle a zero factor in Naive Bayes Classifier calculation?

1 Answers1