You may already be familiar with many versions of Bayes Rule:
p(yi|x) = p(x| yi)p(yi) / p(x) or
p(y|x) = p(x|y)p(y)/p(x) = p(x|y)p(y) / ∑i p(x| yi) p(yi) or
p(y = yi |x = xk) = p(x = xk |y = yi)p(y = yi) /∑ j p(x = xk|y = yj)p(y = yj)
Lets assume that y takes on ‘i’ distinct values or classes. If we know p(x|yi) and p(yi) then we could use Bayes rule to calculate:
p(yi|x) = p(x|yi)p(yi) / p(x) (1)
However, if x is represented as (x1,…….xk) then there are numerous combinations to consider for each class or value of y. This makes the computation of p(x|yi) computationally demanding. One approach would be to simplify this calculation by making a ‘naïve’ assumption of conditional independence:
Case where k = 2
p(x|y) = p(x1,x2|y) = p(x1|y) p(x2|y) = ∏ p(xk|yi) (2) 'conditional independence assumption'
the individual x-values are assumed to be independent, conditional on the value of y.
Under the conditional independence assumption, the number of calculations that need to be made for each value of y is decreased dramatically.
Substituting the calculation in (2) into (1) gives us the Naïve Bayes Classifier.
p(yi|x) = ∏ p(xk|yi) p(yi) / p(x)
Like ridge regression, the Naive Bayes Classifier is another example of a biased estimator that can outperform unbiased estimators given its lower variance. From 'Elements of Statistical Learning: data mining, inference and prediction:
" Despite these rather optimistic assumptions, naive Bayes classifiers often outperform far more sophisticated alternatives.....although the individual class density estimates may be biased, this bias might not hurt the posterior probabilities as much, especially near the decision regions. In fact, the problem may be able to withstand considerable bias for the savings in variance such a “naive” assumption earns"